Anomaly Detection for DiscreteSequences: A Survey

This survey attempts to provide a comprehensive and structured overview of the existing research for the problem of detecting anomalies in discrete/symbolic sequences. The objective is to provide a global understanding of the sequence anomaly detection problem and how existing techniques relate to each other. The key contribution of this survey is the classification of the existing research into three distinct categories, based on the problem formulation that they are trying to solve. These problem formulations are: 1) identifying anomalous sequences with respect to a database of normal sequences; 2) identifying an anomalous subsequence within a long sequence; and 3) identifying a pattern in a sequence whose frequency of occurrence is anomalous.

1.IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 24, NO. 5, MAY 2012 823 Anomaly Detection for Discrete Sequences: A Survey Varun Chandola, Arindam Banerjee, Member, IEEE, and Vipin Kumar, Fellow, IEEE Abstract—This survey attempts to provide a comprehensive and structured overview of the existing research for the problem of detecting anomalies in discrete/symbolic sequences. The objective is to provide a global understanding of the sequence anomaly detection problem and how existing techniques relate to each other. The key contribution of this survey is the classification of the existing research into three distinct categories, based on the problem formulation that they are trying to solve. These problem formulations are: 1) identifying anomalous sequences with respect to a database of normal sequences; 2) identifying an anomalous subsequence within a long sequence; and 3) identifying a pattern in a sequence whose frequency of occurrence is anomalous. We show how each of these problem formulations is characteristically distinct from each other and discuss their relevance in various application domains. We review techniques from many disparate and disconnected application domains that address each of these formulations. Within each problem formulation, we group techniques into categories based on the nature of the underlying algorithm. For each category, we provide a basic anomaly detection technique, and show how the existing techniques are variants of the basic technique. This approach shows how different techniques within a category are related or different from each other. Our categorization reveals new variants and combinations that have not been investigated before for anomaly detection. We also provide a discussion of relative strengths and weaknesses of different techniques. We show how techniques developed for one problem formulation can be adapted to solve a different formulation, thereby providing several novel adaptations to solve the different problem formulations. We also highlight the applicability of the techniques that handle discrete sequences to other related areas such as online anomaly detection and time series anomaly detection. Index Terms—Discrete sequences, anomaly detection. Ç 1 INTRODUCTION S EQUENCE data are found in a wide variety of application domains such as intrusion detection, bioinformatics, weather prediction, system health management, etc. Anom- most important form of sequences encountered in real life applications. In this survey, we will focus on discrete sequences. Discrete or symbolic sequences are ordered sets aly detection for sequence data is an important topic of of events such that the events are symbols belonging to a research as it often leads to detection of actionable and finite alphabet. For example, a text document is a sequence of critical information such as intrusions, faults, and system words, a computer program is executed as a sequence of failures. There is extensive work on anomaly detection system calls, and a gene is a sequence of nucleic acids. techniques [1], [2], [3] that look for individual objects that Anomaly detection for discrete sequences has been a are different from normal objects. These techniques, focus of many research papers. But most of the anomaly referred to as point-based anomaly detection [1], do not take detection techniques have been developed within specific the sequence structure of the data into consideration. For application domains and have been evaluated on specific example, consider the set of user command sequences validation data sets. In particular, several techniques have shown in Table 1. While sequences S1 -S4 represent normal been developed to detect anomalies in operating system call daily profiles of a user, the sequence S5 is possibly an data [4], [5], [6], [7], [8], [9], [10]. Budalakoti et al. [11], [12] attempt to break into a computer by trying different present a study of anomaly detection techniques to detect passwords. Though the sequence S5 is anomalous, each anomalies in the flight safety domain. Sun et al. [13] present command in the sequence by itself is normal. a probabilistic suffix tree-based technique and evaluate it on A sequence is an ordered series of events. Sequences can biological sequences. Note that the nature of the sequence be of different types, such as binary, discrete, and continuous, data as well as the nature of anomalies can differ depending on the type of events that form the sequences. fundamentally across domains, and hence if a technique is Discrete and continuous sequences (or time series) are two shown to be effective within one domain, it does not follow that it will be effective in a different domain. Even though the existing techniques appear to have the . V. Chandola is with the Oak Ridge National Laboratory, 1 Bethel Valley Road, Oak Ridge, TN 37830. E-mail: same objective, i.e., to detect anomalies in discrete . A. Banerjee and V. Kumar are with the Department of Computer Science sequences, a deeper analysis reveals that different techni- and Engineering, University of Minnesota, Minneapolis, MN 55455. ques actually address different problem formulations. Most E-mail: {banerjee, kumar} of the existing research focuses on one of the following Manuscript received 1 May 2009; revised 10 Dec. 2009; accepted 26 July three problem formulations: 2010; published online 18 Nov. 2010. Recommended for acceptance by C. Clifton. 1. Sequence-based anomaly detection—Detecting For information on obtaining reprints of this article, please send e-mail to:, and reference IEEECS Log Number TKDE-2009-05-0399. anomalous sequences from a database of test Digital Object Identifier no. 10.1109/TKDE.2010.235. sequences. 1041-4347/12/$31.00 ß 2012 IEEE Published by the IEEE Computer Society

2.824 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 24, NO. 5, MAY 2012 TABLE 1 sequence anomaly detection problem and how techniques Sequences of User Commands proposed for different domains relate to each other. Our specific contributions are: . We identify three distinct ways in which the anomaly detection problem has been formulated for discrete sequences and review techniques from many disparate and disconnected application do- mains that address each of these problem formula- tions. Such a categorization helps us identify the strengths and limitations of different techniques and 2.Contiguous subsequence-based anomaly detec- could assist researchers in understanding how tion—Detecting anomalous contiguous subse- techniques that address one formulation can be quences within a long sequence. adapted to solve a different formulation. 3. Pattern frequency-based anomaly detection—De- . Within each problem formulation, we group techni- tecting patterns in a test sequence with anomalous ques into categories based on the nature of the frequency of occurrence. underlying algorithm. For each category, we provide These formulations are fundamentally distinct, and a basic anomaly detection technique, and show how hence require exclusive solutions. Even for a single problem the existing techniques are variants of the basic formulation, different techniques have been proposed that technique. This approach shows how different utilize distinct approaches to detect anomalies. Fig. 1 shows techniques within a category are related or different the categorization of existing anomaly detection research from each other. Our categorization reveals new for discrete sequences. variants and combinations that have not been To the best of our knowledge, there is no other study that investigated before for anomaly detection. We also provides a global understanding of the existing research provide a discussion of relative strengths and along the lines of Fig. 1. One reason for lack of such a study weaknesses of different techniques. is because the literature on anomaly detection for sequences . We show how techniques developed for one problem is scattered across varied application domains and hence formulation can be adapted to solve a different most of the studies have been domain oriented. For example, formulation; thereby providing several novel adap- a comparison of four different anomaly detection techniques tations to solve the different problem formulations. for discrete sequences was presented by Forrest et al. [4], but . We highlight the applicability of the techniques that all of the techniques focused on system call intrusion handle discrete sequences to other related areas such as online anomaly detection and time series detection. Another reason is that techniques solve different anomaly detection. problem formulations. For example, Chandola et al. [14] provided a comparative evaluation of eight anomaly This survey can be utilized in three ways. First, it provides a detection techniques on sequence data collected from well structured overview of the research landscape. Second, different domains, but only focused on the sequence-based it allows researchers in one application domain to under- problem formulation. stand the existing work done in a different domain. Third, it allows researchers focusing on a particular problem for- 1.1 Our Contributions mulation to see how techniques developed for a different This survey attempts to provide a comprehensive and formulation can be applied to their problems. structured overview of the existing research for the problem Note that the three formulations covered in this survey of detecting anomalies in discrete/symbolic sequences. The do not cover all possible ways in which the anomaly objective is to provide a global understanding of the detection problem can be posed for discrete sequences. Fig. 1. Overview of existing anomaly detection research for discrete sequences.

3.CHANDOLA ET AL.: ANOMALY DETECTION FOR DISCRETE SEQUENCES: A SURVEY 825 However, most of the existing work can be covered under 1. Sequence of operating system calls/user commands [5], these three formulations. [16], [17], [18]. The sequences are defined using an alphabet of all possible system calls ($160 for Sun 1.2 Terminology Solaris operating system) or user commands We use the following terminology and notation in this paper: ($2;500 for a typical UNIX user). Anomalies in such data sets correspond to anomalous program . A discrete/symbolic sequence is defined as a finite behavior, often caused due to “break-ins” in the sequence of events, such that each event can be computer system by a virus or a malicious user, or represented as a symbol belonging to a finite unauthorized access of a computer system. Most of alphabet. The symbols may belong to a simple the techniques in this domain solve the sequence- alphabet, e.g., events in DNA sequences belong to a based (e.g., identifying anomalous user profiles) or four-letter alphabet—{A; C; G; T }, or to a complex subsequence-based (e.g., online monitoring of alphabet, e.g., the alphabet for system call sequences system usage) problem formulations. (see Table 1) is the set of system calls executed by a 2. Biological sequences such as DNA sequences, protein particular computer system. For simplicity, we will sequences [13], [15]. The sequences are defined using use sequence to refer to a discrete sequence and an alphabet that corresponds to either four nucleic symbol to refer to an element of the sequence. Sets of acid bases or close to 20 amino acid bases. Anomalies sequences will be denoted using bold capital letters, in such data sets correspond to either mutations, or a e.g., S; T. Individual sequences will be denoted condition of disease in the biological organism [19]. using bold small letters, e.g., s; t. Length of a Techniques in this domain either solve sequence- sequence s will be denoted by ls . A sequence within based (e.g., identifying anomalous protein sequences a set will be denoted as si 2 S; tj 2 T. A symbol [19]) or pattern frequency-based (e.g., identification within a sequence s will be denoted as si . of mutations) problem formulations. . A string is defined as an ordered sequence of 3. Sensor data from an operational system (such as aircraft symbols and hence is equivalent to a sequence in or space shuttle) [11], [12], [20]. The sequences are this context. We will use the terms strings and collected during the operation of the system through sequences interchangeably. multiple discrete sensors. The alphabet size for such . A subsequence of a sequence is defined as a smaller discrete sequences is typically large ($1;000 for sequence that can be derived from the original aircraft data [12]). Anomalies in such data sets sequence by deleting some symbols without chan- correspond to fault scenarios in the operation of ging the order of the remaining symbols [15]. the system or accidents. Techniques in this domain Formally, a sequence tð¼ t1 t2 . . .Þ is a subsequence typically solve the sequence-based problem formu- of sð¼ s1 s2 . . .Þ if there exists a set of integers lation to identify faulty operational runs. ði1 < i2 < . . .Þ, such that si1 ¼ t1 ; si2 ¼ t2 . . . . 4. Sequences of transactions from online banking, customer . A substring is defined as a contiguous subsequence purchases, and other retail commerce domains [21]. The of a string or a sequence. In some papers, the term “symbols” in such sequences are “actions” by segment is used instead of a substring. We denote a customers and the alphabet size for such sequences substring of a sequence s as si:j ; i j which starts at can also be large. Anomalies in such data sets ith location and ends at the jth location of the correspond to irregular or abnormal behavior by sequence s. customers. Techniques in this domain typically solve 1.3 Organization the subsequence-based problem formulation to identify anomalous customer behavior. The rest of this survey paper is organized as follows: 5. Navigational click sequences from websites [22], [23]. Section 2 describes different application domains where The “symbols” in such sequences correspond to anomaly detection techniques for discrete sequences have clicked links (websites), or categories to which the been developed. Section 3 describes the three different links belong. The anomalies in such data correspond problem formulations. Sections 4, 5, and 6 provide an to irregular or unauthorized user behavior. Techni- overview of techniques that have been proposed to solve ques in this domain also solve the subsequence- each one of these problem formulations. Section 7 discusses based problem formulation. how the different anomaly detection techniques discussed in the survey are applicable in the context of online anomaly 3 PROBLEM FORMULATIONS detection. Conclusions and future directions of research are The three formulations of the sequence anomaly detection presented in Section 8. problem, i.e., sequence-based, contiguous subsequence- based, and pattern frequency-based anomaly detection, 2 APPLICATIONS OF ANOMALY DETECTION FOR are unique in terms of how the anomalies are defined. For DISCRETE SEQUENCES the first formulation, an entire sequence is anomalous if it is significantly different from normal sequences. For the A few application domains where anomaly detection second formulation, a contiguous subsequence within a techniques have been developed to handle discrete long sequence is anomalous if it is significantly different sequences are as follows: from other subsequences in the same sequence. For the

4.826 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 24, NO. 5, MAY 2012 third formulation, a given test pattern is anomalous if its . Semisupervised anomaly detection. A reference (or frequency of occurrence in a test sequence is significantly training) database is assumed, containing only different from its expected frequency in a normal sequence. normal sequences, and a test sequence is tested From the application perspective, the choice of the against the normal reference database. Formally, this problem formulation depends on the relevance of anoma- variant can be stated as lies to the requirement of the application domain. For example, consider the following three scenarios that can arise in the domain of operating system intrusion detection: Definition 1. Given a set of n sequences, S ¼ fs1 ; s2 ; . . . ; sn g, and a sequence t belonging to a test data set T, assign an .Scenario 1: Sequence-based anomaly detection. A anomaly score to t with respect to the training sequences in S. security analyst is interested in detecting “illegal” user sessions on a computer belonging to a corporate The length of sequences in S and the length of sequences network. An illegal user session is caused when an in T might not be equal. unauthorized person uses the computer with mal- icious intent. To detect such intrusions, the analyst . Unsupervised anomaly detection. The task is to detect can use the first formulation, in which the past anomalous sequences from an unlabeled database of normal user sessions (sequence of system calls/ sequences (unsupervised anomaly detection). Formally, commands) are used as the training data, and a new this variant can be stated as user session (a test sequence) is tested for anomalies against these training data. Definition 2. Given a set of n sequences, S ¼ fs1 ; s2 ; . . . ; sn g, . Scenario 2: Subsequence-based anomaly detection. assign an anomaly score to each sequence in S with respect to A security analyst is interested in detecting if a user’s the rest of S. account was misused (hacked) at some point during a given day. To detect this misuse, the analyst can use In this section, we will primarily discuss techniques that the second formulation, in which the user’s day long handle the semisupervised variant. A semisupervised activity is considered as a long sequence, and is technique can be adapted to solve the unsupervised tested for any (contiguous) anomalous subsequence. problem by treating the entire data set as a training set, . Scenario 3: Pattern frequency-based anomaly detec- and then scoring each sequence with respect to this training tion. A security analyst is interested in determining set. Such adaptations assume that the given set contains few if the frequency with which a user executed a anomalous sequences, and hence semisupervised technique particular sequence of commands is higher (or does not get affected by the presence of anomalies in the lower) than an expected frequency. Going back to training set. the example given in Table 1, the sequence login, Anomaly detection techniques in this section can be passwd,login,passwd corresponds to a failed login classified based on the unit element of a test sequence that is attempt followed by a successful login attempt. analyzed by the technique as follows: Occurrence of this sequence in a user’s daily profile is normal if it occurs occasionally, but is anomalous . Similarity-based techniques. These techniques treat if it occurs very frequently, since it could correspond the entire test sequence as a unit element in the to an unauthorized user surreptitiously attempting analysis, and hence are analogous to point-based an entry into the user’s computer by trying multiple anomaly detection techniques. They typically apply passwords. To detect such intrusions, the analyst can a proximity-based point anomaly detection techni- use the third formulation, in which the sequence of que by defining an appropriate similarity measure commands is the query pattern, and the frequency of for the sequences. the query pattern in the user sequence for the given . Window-based techniques. These techniques analyze a day is compared against the expected frequency of short window of symbols—a short substring— the query pattern in the daily sequences for the user within the test sequence at a time. Thus, such in the past, to detect anomalous behavior. techniques treat a substring within the test sequence The next three sections discuss various techniques that as a unit element for analysis. These techniques address these three problem formulations. Most techniques require an additional step in which the anomalous discussed in this survey assign a score to a sequence, nature of the entire test sequence is determined, contiguous subsequence, or pattern, indicating the magni- based on the analysis on the substrings within the tude with which the entity is considered to be anomalous or entire sequence. normal by the given technique. This score can be used to . Markovian techniques. These techniques predict the construct a ranked list of anomalies, or, with the help of a probability of observing each symbol of the test threshold, converted into a binary label of normal or sequence, using a probabilistic model, and use the anomalous. Some techniques directly assign a binary label per-symbol probabilities to obtain an anomaly score or normal or anomalous to the individual entities. for the test sequence. These techniques analyze each symbol with respect to previous few symbols. . Hidden Markov model (HMM)-based techniques. These 4 SEQUENCE-BASED ANOMALY DETECTION techniques transform the input sequences into Two variants of sequence-based anomaly detection for- sequences of hidden states, and then detect anoma- mulation exist: lies in the transformed sequences.

5.CHANDOLA ET AL.: ANOMALY DETECTION FOR DISCRETE SEQUENCES: A SURVEY 827 A detailed discussion of the above four categories of the LCS have been proposed [32]. A disadvantage of both techniques is given in the next four sections. SMC and nLCS is that they do not incorporate the relation between different pairs of symbols when computing the 4.1 Similarity-Based Techniques similarity, i.e., all matches and mismatches are treated These techniques compute pairwise similarity between equally, though alternative measures such as edit distance sequences using a specific similarity (or distance) measure can be employed to handle this issue. and then make use of a traditional proximity-based point- Other similarity measures that do not require the based anomaly detection algorithm [11], [12], [14]. sequences to be of equal length can also be used instead A basic similarity-based technique operates as follows: of nLCS. One such measure was proposed by Kumar et al. the similarity of the test sequence, t, to each sequence in the [33] by converting the sequences into bitmaps and compar- training set, S, is computed. The similarities are then ing the bitmaps to determine the similarity. Liu et al. [19] combined, e.g., inverse of average similarity, to obtain an used edit distance to determine anomalous protein se- anomaly score for t. quences from a database of sequences corresponding to Several variants of the basic technique have been different organisms. proposed which use different point-based algorithms and Advantages and disadvantages of similarity-based different similarity measures to compare a pair of sequences. techniques. The advantage of similarity-based techniques 4.1.1 Using Different Point-Based Anomaly Detection is that one can use any existing or new similarity measure for sequences [15], [18], [34] and any proximity-based point Algorithms anomaly detection technique [25], [35], and hence can Two point-based anomaly detection algorithms that have devise a unique anomaly detection which is best suited for a been used for discrete sequences are k-nearest neighbor given problem. (kNN) based [24] and clustering based [25]. Chandola et al. A disadvantage of similarity-based techniques is that [14] proposed a kNN-based technique in which the their performance is highly dependent on the choice of the anomaly score of a test sequence is equal to the dissim- similarity measure. Similarity measures such as nLCS are ilarity1 to its kth nearest neighbor in the training data set S. able to distinguish between normal and anomalous test Budalakoti et al. [11], [12] proposed a clustering-based sequences only when anomalous sequences are signifi- technique in which the training sequences are first clustered cantly different from the training sequences. If the into a fixed number of clusters using the k-medoid anomalies are localized in the test sequence, similarity algorithm [26]. The anomaly score for a test sequence is measures such as nLCS may fail to identify them. Another then computed as equal to the inverse of its similarity to its disadvantage of similarity-based techniques is the Oðn2 Þ closest medoid. Probabilistic clustering techniques, which complexity involved in computing pairwise similarity do not require an explicit similarity matrix to find clusters between sequences in S. in the training set, have also been used for anomaly detection. For example, Yang and Wang [27] proposed a 4.2 Window-Based Techniques technique that uses mixtures of Probabilistic Suffix Trees [28] Window-based techniques extract fixed-length overlapping as cluster representations. Other probabilistic techniques windows from a test sequence. Each window is assigned an such as mixture of HMMs [29], [30] or mixture of Maximum anomaly score. The anomaly scores of all windows within a Entropy (maxent) models [23] can also be employed in the test sequence are aggregated to obtain an anomaly score for same manner for clustering-based anomaly detection. the entire test sequence. These techniques are particularly useful when the cause of anomaly can be localized to one or 4.1.2 Using Different Similarity Measures more shorter substring within the actual sequence [16]. If The simplest similarity measure for comparing a pair of the entire sequence is analyzed as a whole, the anomaly discrete sequences is the Simple Matching Coefficient (SMC) signal might not be distinguishable (as in similarity-based [31] which counts the number of positions in which the two techniques) from the inherent variation that exists across sequences match as its similarity. While computing the sequences. By analyzing a short window at a time, window- similarity is fast (linear in the length of the sequences), this based techniques try to localize the cause of anomaly within measure has a drawback that it requires the sequences to be one or a few windows. of equal lengths. The standard technique to obtain short windows from a Several anomaly detection techniques use the length of sequence is to slide a fixed-length window, one symbol at a the longest common subsequence as a similarity measure time, along the sequence. Let us assume that for a given since it can compute similarity between two sequences of sequence s, the extracted windows are denoted by unequal lengths [11], [12], [14]. This similarity (nLCS) !1 ; !2 . . . !t and each window !i can also be denoted as between two sequences si and sj , is computed as !i1 !i2 . . . !ik . jLCSðsi ; sj Þj To further understand the motivation behind the win- nLCSðsi ; sj Þ ¼ pffiffiffiffiffiffiffiffiffiffiffiffiffi ; ð1Þ dow-based techniques, let us assume that an anomalous test jsi ksj j sequence t contains a substring t0 (of a certain length), which where LCSðsi ; sj Þ is the longest common subsequence is the actual cause of anomaly. In a sliding window-based shared by si and sj . technique, if the length of the window is k, the anomalous The disadvantage of using nLCS is the computational substring t0 will occur (partly or whole) in jt0 j þ k À 1 complexity involved, though faster algorithms to compute windows. Thus, the anomalous sequence can be potentially detected by detecting at least one of such windows. On the 1. An inverse function of similarity. other hand, if jt0 j ( jtj, a similarity-based technique might

6.828 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 24, NO. 5, MAY 2012 not be able to resolve the difference between anomalous and the same paper, another technique to compute Að!i Þ is normal test sequences. presented. Að!i Þ is 1 if the window !i is not present in the A basic window-based technique operates as follows: in normal dictionary, and 0 if the window is present in the the training phase, k-length sliding windows are extracted normal dictionary. Hamming Distance has also been used in from all training sequences and the frequency of occurrence [40], [41]. The latter approach has been adopted by [36], of each unique window in the training data set is [38], [39], [42]. maintained (as a normal dictionary). In the test step, sliding Lane and Brodley [18], [34], [43] use the following windows of length k are extracted from the test sequence, t. similarity measure to compute similarity Simð!i ; #j Þ be- A window !i is assigned a likelihood score Lð!i Þ which is tween a test window, !i and a window in the normal equal to the frequency associated with the window !i in the dictionary, #j : normal dictionary. A threshold  is used to determine if a 8 particular window !i is anomalous (Lð!i Þ < ) or not < 0; if i ¼ 0; (Lð!i Þ ! ). The anomaly score of the test sequence is wð!i ; #j ; lÞ ¼ or !il 6¼ #jl ; ð3Þ : proportional to the number of anomalous windows in the 1 þ wð!i ; #j ; l À 1Þ; if !il ¼ #jl ; test sequence. The exact expression for the anomaly score of and a test sequence, t is given by X k ji : Lð!i Þ < ; 1 i jtjj Simð!i ; #j Þ ¼ wð!i ; #j ; lÞ; ð4Þ AðtÞ ¼ : ð2Þ jtj l¼0 The above-mentioned basic window-based technique has where k is the length of the windows. In other words, if the been proposed for operating system call intrusion detection two windows match at a location l and the two windows and is termed as threshold-based sequence time delay also matched at previous location, the total similarity will embedding (t-STIDE) [4], [5]. be incremented more than if the two windows did not Several variants of the t-STIDE technique have been match at previous location, in which case the increment will proposed, especially for system call intrusion detection [16], be of 1. Thus, in the above similarity measure, a series of [18], [34], [36], [37], [38], [39], [40]. These variants differ from consecutive matching symbols can accumulate a large t-STIDE in terms of how they assign an anomaly score to a similarity, while even a single mismatch in the middle can window, and how they combine the scores to obtain a greatly reduce the similarity. global anomaly score for the test sequence. The anomaly score assigned to !i is equal to the inverse of maximum similarity between !i and windows in the 4.2.1 Assigning Anomaly Scores to Windows normal dictionary. Lane’s PhD thesis [44] discusses other In the t-STIDE technique, the anomaly score for each variants of the above described similarity measure in the window !i (denoted as Að!i Þ) is equal to the inverse of context of anomaly detection. frequency of occurrence of the window in the normal As mentioned earlier, the dictionaries are typically built dictionary. Other techniques have been proposed to assign for normal behavior. Certain techniques construct diction- a different anomaly score to a window. We will discuss aries for only the anomalous behavior [42], [45], [46], [47], three types of such techniques here. [48]. For example, Dasgupta and Nino [42] first build a 1. Using lookahead pairs. Techniques under this normal dictionary of windows. They then generate random category make use of lookahead pairs to assign a score to a windows from the given alphabet. Any window that does window, !i [16]. A lookahead pair is defined as h ; ij , such not match a normal window is treated as anomalous that the symbol occurs in the jth location after the symbol window and added to the anomaly dictionary. A compara- in at least one of the windows in the normal dictionary. tive study of using normal and anomaly dictionaries is Since the windows are of length k, j lies between 1 and presented in [49]. k À 1. All such lookahead pairs are added to a secondary 3. Using a classifier. Instead of using the frequency of normal dictionary. During testing, for each window !i , all occurrence of a window to compute its anomaly score, some potential lookahead pairs are extracted. For a length techniques use a classifier to assign an anomaly label to k window, there can be kðkÀ1Þ2 such potential pairs. The each window. The training involves learning a classifier number of potential lookahead pairs that do not exist in the from the set of k-length windows obtained from the training secondary normal dictionary is counted as the number of data set S. If S contains both normal and anomalous mismatches. The anomaly score of the window is calculated sequences, one way to obtain labeled windows was as the number of mismatches divided by total number of proposed in [10]: potential mismatches (¼ kðkÀ1Þ 2 ). 2. Comparing against a normal dictionary. Techniques . Windows extracted from normal sequences of S are under this category construct a normal dictionary (of fixed- assigned a normal label. length windows) from training sequences and compare . If a window extracted from an anomalous sequence each window from the test sequence to the normal of S occurs in any normal sequence also, it is dictionary to obtain an anomaly score. assigned a normal label, else it is assigned anom- Hofmeyr et al. [5] use Hamming Distance (or number of alous label. mismatches) between the test window !i and the closest If S contains only normal sequences, a random anomaly window in the normal dictionary as anomaly score Að!i Þ. In generation approach is used [9], [39], [42]. All windows

7.CHANDOLA ET AL.: ANOMALY DETECTION FOR DISCRETE SEQUENCES: A SURVEY 829 extracted from sequences in S are assigned a normal label. sequence is declared to be anomalous. This technique To obtain anomalous windows, random windows are provides the same benefit as LFC technique. generated. If the window occurs in S, it is ignored, or else Advantages and disadvantages of window-based tech- it is assigned an anomalous label and added to the niques. A key advantage of window-based techniques over training data set. similarity-based techniques is that they are better suited to After constructing a training data set, a classifier is detect anomalies which are localized in a short region in the learned. Different types of classification models have been test sequence. Constructing normal dictionaries is simple to learned, such as HMM-based [10], neural networks [9], implement and can be done efficiently using appropriate [36], [39], [50], SVM [51], [52], and rule-based classifiers data structures. [53]. During testing, the anomaly score for each One disadvantage of window-based techniques is that window !i obtained from test sequence t, Að!i Þ ¼ 0 if they are highly reliant on the value of k (length of the the classifier labels it as normal, and Að!i Þ ¼ 1 if the window). If k is chosen to be very small, most k-length classifier labels it as anomalous. windows will have a high probability of occurrence in the training sequences, while if k is chosen to be very large, 4.2.2 Obtaining Anomaly Score for Test Sequence most k-length windows will have a low probability of Once all windows in a given test sequence, t, are assigned occurrence in the training sequences. Thus, in either case, an anomaly score, the next step is to combine the anomaly the discriminatory power of the k-length windows to scores (a vector of length jtj À k þ 1) to obtain an overall differentiate between normal and anomalous sequences anomaly score AðtÞ. One such technique was discussed will be low. Setting an optimal value for k is challenging. earlier for the t-STIDE technique. Here, we will describe Another disadvantage is that storing all unique windows some other combination techniques that have been pro- that occur in training sequences and their frequencies can posed in the literature. require a large amount of memory. Hofmeyr et al. [5] propose two methods to compute the overall anomaly score. The first method computes this as 4.3 Markovian Techniques the average of the strength of anomaly signal over all The third category of techniques that solve sequence-based windows in t. formulation learn a model from the training sequences. The model is used as an approximation of the “true” distribu- 1 ltX Àkþ1 tion that generated the normal data. Typically, the prob- AðtÞ ¼ Að!i Þ; ð5Þ t i¼1 ability of a given sequence tð¼ ht1 ; t2 ; . . . tlt i) is factorized using the chain rule: where lt is the length of t. The second method checks if any window in t has a mismatch. Y lt P ðtÞ ¼ P ðti jt1 t2 . . . tiÀ1 Þ; ð7Þ  1; if 9 i : Að!i Þ ! 1; i¼1 AðtÞ ¼ ð6Þ 0; otherwise: where lt is the length of t. Forrest et al. [4] proposed a technique to obtain AðSq Þ, Markovian techniques utilize the short memory property known as locality frame count (LFC). For every mismatching of sequences, which is observed across different domains window Að!i Þ ¼ 1, the technique counts the number of [28]. The short memory property is essentially a higher mismatching windows in the previous n windows of t (n is order Markov condition which states that the conditional a user defined parameter). If this number is greater than a probability of occurrence of a symbol ti , given the sequence threshold, AðSq Þ is incremented. The LFC technique observed so far can be approximated as considers a sequence highly anomalous only when a P ðti jt1 t2 . . . tiÀ1 Þ ¼ P ðti jtiÀk tiÀkþ1 . . . tiÀ1 Þ; ð8Þ significant number of anomalous windows occur close to each other. If the anomalous windows are located far apart where k > 1. across t, the LFC method gives a lower anomaly score to t. Markovian techniques operate in two phases, training A similar technique has been proposed by Gao et al. [10]. and testing. Training involves learning the parameters of a The intuition behind the LFC approach is that anomalies probabilistic model of the training sequences and testing in actual anomalous sequences typically occur as tempo- involves computing the likelihood of the test sequence rally clustered anomalous symbols; hence, aggregating the given the parameters (see (7)). anomaly scores across the entire sequence t might “wash out” the anomaly signal; the local combination techniques 4.3.1 Fixed Markovian Techniques on the other hand would capture such behavior. Such techniques use a fixed history (of length k) to estimate A combination technique similar to LFC was proposed in the conditional probability of a symbol in the test sequence. [39] called the leaky bucket. They use the vector obtained of A basic fixed Markovian technique “learns” the condi- anomaly scores for the windows in the test sequence. For tional probability of occurrence of a given symbol ti as each anomalous window, 1 is added to the global anomaly fðtiÀk . . . tiÀ1 ti Þ score and for each normal window, 1 is subtracted (the P ðti jtiÀk . . . tiÀ1 Þ ¼ ; ð9Þ fðtiÀk . . . tiÀ1 Þ score is never allowed to fall below 0). Thus, consecutive anomalies result in the global score to increase. If at any where fðtiÀk . . . tiÀ1 ti Þ is the frequency of occurrence of the time the global score goes above a threshold, the entire test substring tiÀk . . . tiÀ1 ti in the sequences in S and

8.830 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 24, NO. 5, MAY 2012 fðtiÀk . . . tiÀ1 Þ is the frequency of occurrence of the substring large to provide a reliable estimate of the conditional tiÀk . . . tiÀ1 in the sequences in S. probability of a symbol that follows this substring. For Different variants of the basic technique have been example, let us assume that the substring aabbb occurs once proposed. Ye [54] proposed one such technique for k ¼ 1. across all training sequences and is followed by symbol b For this special case, the conditional probability of occur- in that solitary occurrence. A fixed Markovian technique rence of given symbol ti can be written as (k ¼ 5) will assign a conditional probability of 1 to P ðbjaabbbÞ. But this conditional probability is not reliable, fðtiÀ1 ti Þ P ðti jtiÀ1 Þ ¼ : ð10Þ and hence might give undesirable results. Variable fðtiÀ1 Þ Markovian techniques try to address this issue by allowing An issue with the basic fixed Markovian technique is that symbols to be conditioned on a variable length history. For storing all transition frequencies (see (9)) can potentially the above example, P ðbjaabbbÞ might be substituted with require a huge amount of space. For an alphabet set size of P ðbjabbbÞ if the context abbb is more optimal to a given , the total number of frequencies required by the basic pruning criterion, e.g., frequency of abbb is greater than a technique will be ¼ ð À 1Þk . certain threshold. Models such as Probabilisitic Suffix Trees Michael and Ghosh [6] address this issue by storing the (PSTs) [28] and Interpolated Markov Models (IMM) can be frequencies in an Extended Finite State Automata (EFSA) [6]. used to efficiently compute the variable length conditional The EFSA is like a traditional FSA but with frequencies probabilities of a symbol. associated with the nodes and the transitions from one node One such technique was proposed by Sun et al. [13] to another. Each node denotes a k-length substring. A node using PSTs. A PST is a compact tree representation of a has a transition to another node only if the substring for the variable Markov chain, which uses classical suffix trees as its second node can be obtained from the substring for the first index structure. In a PST, each edge is labeled using a node by removing the first symbol of the first node and symbol, and each node represents the substring obtained by appending any symbol at the end of the substring. The traversing the path from root to the node, as well as the number of times the substring for a node occurs in the number of times the substring is observed in the training sequences in S is stored at the node. The number of times a sequences. Each node also stores the conditional probability transition from one node to another is observed in the of observing each symbol in the alphabet, given the sequences in S are stored at the transitions. Only those substring represented by the node. The PST is grown nodes and transitions that are observed in the training (training phase) by scanning the training sequences. The sequences are stored in the EFSA. Thus, the size of EFSA is maximum depth of the tree is fixed at k, which is a user typically smaller than the total possible size. defined parameter. Several pruning criteria are applied to During testing, ðk þ 1Þ sized windows are extracted from the PST to ensure that the PST contains only those paths the test sequence t. The first k symbols determine the that occur significantly enough number of times in the current state in the EFSA while the last k symbols denote the training sequences. The pruning can be done by applying next state in the EFSA. If a transition between the two states thresholds to the frequency of a node label, or to the is defined, the conditional probability for the last symbol of conditional probability of a symbol emanating from a given the window is computed by using the frequencies stored in node. If no pruning is applied, the PST is equivalent to the the current node and the next node (see (9)). If the transition EFSA technique discussed in Section 4.3.1. is not defined, the symbol is ignored. Chandola et al. [14] The PST-based technique computes the likelihood for the proposed a variant of the EFSA technique in which if the test sequence t with respect to the PST. Sun et al. [13] transition is not defined, the conditional probability of the investigate two likelihood-based measures, Odds measure last symbol of the window is set to 0. and Normalized measure. The Normalized measure is Michael and Ghosh [6] extended the EFSA-based computed as technique to compute the conditional probability for more ! than one symbol, given the previous k symbols, to obtain 1 X lt the final anomaly score for t. LðtÞ ¼ log P ðti jtiÀkþ1 . . . tiÀ1 Þ : ð11Þ lt i¼1 Another variant of the basic technique was proposed by Marceau [55] which uses suffix trees. In this technique, the The PST provides an efficient data structure to compute the suffix tree only maintains the ðk þ 1Þ length substrings and conditional probability, P ðti jtiÀkþ1 . . . tiÀ1 Þ ¼ P ðti jtiÀjþ1 . . . their k-length suffixes that exist in the sequences in S. Thus, tiÀ1 Þ, where ðj kÞ, such that tiÀjþ1 . . . tiÀ1 is the longest this technique learns a traditional FSA. During testing, all suffix of tiÀkþ1 . . . tiÀ1 which occurs as a path in the PST. ðk þ 1Þ length substrings are extracted from t and fed into The Odds measure is computed as the FSA. An anomaly is detected if the FSA reaches a state Qlt from where there is no outgoing edge corresponding to the i¼1 P ðti Þ last symbol of the current substring. LðtÞ ¼ Qlt ; ð12Þ i¼1 pðti jtiÀkþ1 . . . tiÀ1 Þ 4.3.2 Variable Markovian Techniques where pðti Þ is the empirical probability of symbol ti in S An issue with fixed Markovian techniques is that they and j k. force each symbol of a test sequence to be conditioned on Sun et al. [13] have shown that the Normalized measure in the previous k symbols (see (9)). Often, the frequency of a (11) performs better than the Odds measure in (12) for k-length substring, i.e., ðtiÀk . . . tiÀ1 Þ, may not be sufficiently anomaly detection using protein sequences.

9.CHANDOLA ET AL.: ANOMALY DETECTION FOR DISCRETE SEQUENCES: A SURVEY 831 4.3.3 Sparse Markovian Techniques is significantly low, but the probability of observing that Variable Markovian techniques described above allow a symbol after one or more k2 length substrings is sufficiently symbol ti to be analyzed with respect to a history that could high. In that case, the fixed Markovian techniques with be of different lengths for different symbols; but they still history k will assign a high anomaly score to any choose contiguous and immediately preceding symbols to occurrence of that symbol in a test sequence, thereby ti 2 t. Sparse Markovian techniques are more flexible in the increasing the false positive rate. On the other hand, the sense that they estimate the conditional probability of ti variable and sparse techniques will condition that symbol based on symbols within the previous k symbols, which are with respect to a k2 length history, and hence will assign it a not necessarily contiguous or immediately preceding to ti . low anomaly score in a test sequence. In other words, the symbols are conditioned on a sparse The above-mentioned strength of variable and sparse history. Using the example from Section 4.3.2, if the Markovian techniques can also become a disadvantage. For sequence “aabbb” occurs rarely in the training sequence, these techniques, the probability of a “truly” anomalous the conditional probability P ðbjaabbbÞ can be potentially symbol will be boosted since it will be conditioned on a shorter history, whereas the fixed Markovian technique replaced with P ðbjaXbXbÞ where X can be replaced with will still assign a low probability to such a symbol. Thus, any symbol of the alphabet. the variable and sparse techniques might suffer with One such sparse technique has been proposed by Eskin higher false negative rate. Chandola et al. [14] compare a et al. [17], using Sparse Markov Transducers (SMTs) [56]. fixed Markovian technique with a variable and a sparse SMTs, similar to probabilistic suffix trees, estimate a Markovian technique on data from different application probability distribution conditioned on an input sequence. domains and show that the fixed Markovian technique SMTs generalize probabilistic suffix trees by allowing for (using EFSA) outperforms the variable (using PST) and the wild cards in the conditioning sequences. The proposed sparse (using RIPPER) techniques on many data sets. The technique estimates the probability distribution of an same paper also provides scenarios in which the variable output symbol, conditioned on an input sequence. SMTs and sparse Markovian techniques can be better than the are sparse in the sense that they allow some positions in the fixed Markovian techniques. conditioning input sequence to be ignored by introducing wild cards. In the training phase, a forest of SMTs is learned 4.4 Hidden Markov Models-Based Techniques from the training sequences to account for wild cards at Hidden Markov Models are powerful finite state machines different positions in the paths from the root. The depth of that are widely used for sequence modeling [58]. An HMM the SMTs is fixed at a user defined depth k. The testing is parameterized by a hidden state transition matrix and an phase is exactly like the testing phase for the PST technique observation matrix. The three key learning tasks associated described earlier. with HMMs are: 1) for a given set of observation Lee et al. [7] proposed a different sparse Markovian sequences, learn the most likely HMM parameters which technique that uses a classification algorithm (RIPPER [57]) result in maximum likelihood for the observation se- to build sparse models for sequences. In this technique, quences, 2) for a given HMM, compute the hidden state k-length windows are extracted from the training sequences sequence that is most likely to have generated a given test in S. The first ðk À 1Þ positions of these windows are treated sequence, and 3) for a given HMM, with given state as ðk À 1Þ categorical attributes, and the kth position is transition and observation matrices, compute the prob- treated as the target class. RIPPER is used to learn rules ability of a given test sequence. Thus, HMMs can be used from this categorical data set to predict the kth symbol not only to estimate the probability of a test sequence given the first ðk À 1Þ symbols. During testing, for each (task 3) but also to transform an observation sequence into symbol ti , the first rule that fires for the vector tiÀkþ1 . . . tiÀ1 a hidden state sequence (task 2). Both of these capabilities is chosen. If the target of the selected rule is ti , then the of HMMs can be used to solve the anomaly detection probability P ðti jtiÀkþ1 . . . tiÀ1 Þ ¼ 1. If the target of the problem for discrete sequences. selected rule is not ti , then P ðti jtiÀkþ1 . . . tiÀ1 Þ is the inverse One approach to use HMMs for anomaly detection is to of the confidence associated with the selected rule. Since the first learn an HMM that best describes the normal training precedent of a RIPPER rule is an instance of the subset of sequences (task 1), and then compute the probability of each the ðk À 1Þ attributes, the RIPPER-based technique learns a test sequence using the learned HMM. The negative log of sparse model from the training data. the probability can be used as the anomaly score for the test Advantages and disadvantages of Markovian techni- sequence. The learning step is typically done using the ques. The key advantage of Markovian techniques is that Baum Welch algorithm [59], while the probability computa- they analyze each event with respect to its immediate tion is done using the forward algorithm. This approach has context. Thus, such techniques are able to detect anom- been used by Lane [44], [60] to identify anomalous alous sequences even if the anomalies are localized within computer usage in system call data corresponding to user a long sequence. The variable and sparse Markovian behavior. Yamanishi and Maruyama use a similar approach techniques provide flexibility in terms of the size of context for identifying anomalies in system log data [61], the only (or the length of the history, k), with respect to which a difference being that they use a mixture of HMMs to model symbol is analyzed. The advantage of such flexibility can the normal sequences. be illustrated using following example. Let us assume that Another approach to using HMMs for anomaly detection in a database of normal sequences, the probability of is to analyze the most likely or optimal hidden state observing a particular symbol after any k-length substring sequence (task 2), which can be obtained using the Viterbi

10.832 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 24, NO. 5, MAY 2012 algorithm [62]. Florez et al. [63] label each state transition in (or substrings). To avoid confusion, we will use the term the optimal state sequence as normal or anomalous by discord to denote “contiguous subsequences” even though applying a threshold on the state transition probability. The the original papers may use the term subsequence only. number of anomalous state transitions for a test sequence is The contiguous subsequence-based problem formulation used as its anomaly score. Alternatively, Forrest et al. [4] is natural for several application domains, where an activity compute the average state transition probability for the is being monitored over a long period of time. For example, entire test sequence and use the average value to compute in credit card fraud detection, an individual’s credit card the anomaly score for the test sequence. transactions are continuously monitored, and a discord, i.e., The computation of the optimal hidden state sequence an anomalous sequence of actions (purchases), may indicate from an observation sequence can also be viewed as a data a case of identify theft/misuse. transformation step. Once all training and test observation A basic anomaly detection technique can be described as sequences are converted to corresponding state sequences, follows: first, all k-length windows are extracted from the any sequence anomaly detection technique that has been given sequence t and stored as a database of fixed-length discussed in previous sections can be applied to estimate windows, denoted as T k . Each window is assigned an the anomaly score for the test sequence. Qiao et al. [64] anomaly score by comparing it with rest of the windows in apply a window-based technique in the hidden state space. T k , e.g., computing average distance of the window to the An extension to the latter technique was proposed by windows in T k . The windows with anomaly scores above a Zhang et al. [65], in which the state transition sequences user defined threshold are chosen as the top discords. Since from the HMM are used to train another HMM. The state the length of the “true” discord is not known a priori, it is transition sequences from the second HMM are then used generally assumed that the discords are contained within a as input to a window-based anomaly detection technique. contiguous subsequence (or a window) of fixed length k. Advantages and disadvantages of HMM-based techni- This basic technique is at the heart of a number of ques. Key strength of HMM-based techniques is that they techniques investigated by Keogh et al. [66], [67], [68]. Note can model complex systems. Even when the normal that these techniques were originally presented in the observation sequences appear significantly different from context of time series data, but can be extended to discrete each other, the corresponding hidden state sequences will sequences by discretizing the time series using techniques be, in principle, more similar to each other and different such as Symbolic ApproXimation (SAX) [69]. from the anomalous sequences. Forrest et al. [4] compared An issue with the basic technique is that when any HMMs against window-based and Markovian techniques window is compared with the other windows in T k , the for system call sequences and concluded that the perfor- windows which overlap with the given window in t, will mance of HMM was comparable, and sometimes better be highly “similar” to the given window. For example, than the other techniques. any window will differ in at most one position with the The key assumption for HMM-based techniques is that immediately next window in the given sequence. This the normal sequences are generated from a probabilistic property will be exhibited by both normal windows as model, i.e., the HMM. If this assumption does not hold true well as windows that contain discords, and can bias the or the parameters are not estimated accurately (due to computation of anomaly score for a window. Keogh et al. suboptimal initializing conditions), the HMM-based tech- [67] propose a simple solution (referred to as nonself match nique will not be able to effectively distinguish between in the paper) to this issue by comparing a window with normal and anomalous sequences. The computational only the windows in T k that do not overlap with the complexity associated with HMM-based learning is another given window. drawback of such techniques, as noted by Forrest et al. [4]. Several variants of the basic techniques have been proposed, and can be broadly grouped into two categories. The first category of techniques scores the windows 5 CONTIGUOUS SUBSEQUENCE-BASED ANOMALY differently, while the second category of techniques DETECTION addresses the time complexity of the basic technique. Techniques under this category solve the following anom- 5.1 Window Scoring Techniques aly detection problem: One possible technique for scoring the windows is to count Definition 3. Detect short contiguous subsequences in a long the number of times a window occurs in the database of all sequence t, that are anomalous with respect to rest of t. windows, T k (this is similar to the window-based techni- ques such as t-STIDE, discussed in Section 4.2). The Techniques in this category try to identify anomalous anomaly score of the window will be the inverse of this contiguous subsequences (or substrings) of a long se- count. While for smaller values of k this is a reasonable quence, though the formulation can also be applied to technique, it might not be possible to find exact matches of detecting noncontiguous subsequences. The anomalous the window in T k when the value of k is large. contiguous subsequences are also defined as discords by In another possible technique, the anomaly score of a Keogh et al. [66]: discords are subsequences of a longer time window is calculated as equal to its distance to its series that are maximally different from the rest of the time series mth nearest neighbor in T k . One such variant, called subsequences. Note that, many papers in this category [66] HOTSAX [67], was proposed by Keogh et al., in which use the term “subsequence” (which can have gaps) even m ¼ 1, i.e., the anomaly score of a window is equal to its though they primarily deal with contiguous subsequences distance to its nearest neighbor in T k (only the nonself

11.CHANDOLA ET AL.: ANOMALY DETECTION FOR DISCRETE SEQUENCES: A SURVEY 833 matches are considered). One drawback of the nearest window will never figure in the top p anomalous windows, neighbor-based technique is that they involve an additional and hence can be pruned. parameter, m, which needs to be set carefully, though Since a vast majority of windows tend to be normal, the approaches such as using the weighted sum of distance to technique has the potential for substantial pruning, espe- the m nearest neighbors to compute the anomaly score can cially if the anomalous windows are discovered early. be used to reduce the sensitivity on m. Such pruning has been successfully used for traditional Another variation of the basic anomaly detection anomaly detection [75], and has been applied to discord technique, known as Window Comparison Anomaly Detection detection [67], [68], [72]. It should be noted that this pruning (WCAD), was proposed by Keogh et al. [70]. To assign an method guarantees the same result as the basic technique, anomaly score to each window in the sequence t, the but can result in lower execution time. window is compared against rest of the sequence (say t0 ) 5.3 Segmentation Techniques using an information theoretic measure called Compression- Choosing an optimal value of k is challenging. If k is set to Based Dissimilarity (CDM). For a window !i 2 T k , the CDM be very small, all k-length windows might appear highly is defined as probable, resulting in high false negative rates. If k is set to Cð!i t0 Þ be very large, all k-length windows might have a low CDMð!i ; t0 Þ ¼ ; ð13Þ occurrence probabilities, resulting high false positive rates. Cð!i Þ þ Cðt0 Þ This challenge becomes more significant if the sequence where CðxÞ is the compression attained by any standard contains multiple discords, of varying length. In this case, a compression algorithm on a given string x. The intuition single value of k might not be enough to detect all discords. behind using the measure defined in (13) is that if the One approach to address this challenge is to extract window !i is normal, it will match the rest of the unequal length segments (or substrings) from the given sequence very well and hence will not require extra space sequence t, as proposed by Chakrabarti et al. [21]. when both !i and t0 are compressed together. On the other Originally, this segmentation-based technique was pro- hand, if !i is a discord, it will not match the rest of the posed for a sequence of market baskets, but this can be sequence and hence the value of Cð!i t0 Þ, and hence the easily generalized to a discrete sequence, by encoding the anomaly score, will be high. alphabets as bit vectors, and hence treating them as market Wei et al. [71] proposed a variation of the basic technique baskets. In this approach, T is segmented into unequal by sliding two adjacent windows along the sequence. The length segments, such that the sum of the number of bits anomaly score of each window is computed by comparing required to encode each segment (using Shannon’s Source its bitmap with the bitmap of the previously adjacent Coding Theorem) is minimized. The segments which require window. The length of the preceding window is not highest number of bits for encoding are treated as discords. It is shown that an optimal Oðl2t Þ solution exists to find such required to be same as the current window. The underlying segmentation when the number of items is 1, i.e., for a assumption for this technique is that a normal window will binary sequence. Approximate algorithms are proposed to be similar to the previously adjacent window, while a handle the case when the number of items is more than 1, window containing a discord will be significantly different. though the technique is limited to small item set sizes, or Several of the above-mentioned variants measure simi- small alphabets. Gwadera et al. [76] proposed a variable larity/distance between a pair of windows. A number of Markov chain (similar to the variable Markovian model similarity/distance measures such as Simple Matching discussed in Section 4.3.2) based segmentation technique for Coefficient, edit distance, length of longest common subsequence, discrete sequences which can also be employed in a similar weighted SMC, chaos bitmaps [71] can be used. manner to identify discords. 5.2 Optimizing the Complexity of the Basic 5.4 Relationship between Sequence-Based and Technique Contiguous Contiguous Subsequence-Based An issue with the basic technique that solves contiguous Anomaly Detection Techniques subsequence-based formulation is that it requires Oðl2t Þ Techniques that handle sequence-based formulation and comparisons of window pairs, where lt is the length of the the techniques that handle contiguous subsequence-based sequence t. Several faster variations have been proposed formulation have been developed independently since they that can run in approximately linear time in the context of solve distinct problems. Here, we will discuss how continuous time series [67], [68], [72], [73], [74], and can be techniques belonging to the first category can be used to extended to discrete sequences. solve the second problem, and vice versa. One general technique for reducing complexity of the basic technique makes use of the following fact. Instead of 5.4.1 Using Sequence-Based Anomaly Detection scoring all windows, they score as many windows as Techniques to Detect Contiguous Contiguous required to get the top p anomalous windows. Thus, a Subsequence-Based Anomalies window can be pruned if at anytime its distance to its The basic contiguous subsequence-based anomaly detection current mth nearest neighbor is lower than the anomaly technique assigns an anomaly score to a given window !i score of the current window with pth largest anomaly score. with respect to a database of windows Tk (a majority of Since the distance of the window to its actual mth nearest which are assumed to be normal). Any sequence-based neighbor is upper bounded by the current distance, this anomaly detection technique, described in Section 4, can be

12.834 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 24, NO. 5, MAY 2012 applied here by considering Tk À !i as the training data set normal sequences. This formulation is also related to case S and the given window !i as a test sequence Sq . For versus control analysis [78], used in epidemiological studies. Markovian techniques, such as EFSA and PST, extracting The idea is to detect patterns whose frequency of occur- windows from T is not explicitly required, since a model rence in a given test data set (case) is different from its can be constructed using the single long sequence T , and occurrence in a normal data set (control). Such techniques then a window can be tested against the model to obtain an aim to prioritize the patterns ( ) occurring in a known anomaly score. anomalous sequence (t) based on their frequency in the The key issue with using techniques, such as EFSA and sequences in S. PST, is that a model has to be constructed for every window !i , using the data set Tk À !i . A simple solution to this issue 6.1 A Basic Technique to Solve Pattern Frequency-Based Formulation is to construct a single model using the entire sequence Tk , and then score each window against this model, but the A basic technique to solve the above problem assigns an model will get biased from the discords that exist in the anomaly score to the query pattern, , as the difference original sequence T . However, this bias will be small as between the frequency of occurrence of in the sequence t long as the size of the discord is small compared to the and the average frequency of occurrence of in the entire sequence. Unsupervised techniques such as cluster- sequences in set S. ing and k-nearest neighbor can also be applied to assign an More formally, the following two quantities may be anomaly score to each window, though the self match issue, defined, ft ð Þ and fS ð Þ: discussed earlier, needs to be explicitly handled. ft ð Þ, also called the relative frequency of the query pattern, is the frequency with which the query pattern 5.4.2 Using Contiguous Contiguous occurs in the test sequence t, normalized by the length of t, Subsequence-Based Anomaly Detection and can be directly computed as Techniques to Detect Sequence-Based Anomalies ft ð Þ If all training sequences in S and the given test sequence ft ð Þ ¼ ; ð14Þ lt Sq in sequence-based formulation are arranged linearly, in no particular order, a long sequence can be constructed. where ft ð Þ is the frequency of the query pattern in the test An anomaly detection technique discussed to handle sequence. fS ð Þ is the average frequency of the query contiguous contiguous subsequence-based formulation pattern to occur in a sequence in S normalized by the length can be applied to detect all anomalous fixed-length of the sequence. fS ð Þ can be estimated as windows. The anomaly score of the test sequence Sq can 1 X fsi ð Þ be assigned as equal to the number of windows of Sq that fS ð Þ ¼ : ð15Þ jSj 8s 2S jlsi j are detected as discords. The key issue with this adapta- i tion is that the entire test sequence will be compared (as a The anomaly score of the query pattern is computed as window) with other training sequences, and hence will face same challenges as the similarity-based techniques Að Þ ¼ jfS ð Þ À fS ð Þj: ð16Þ discussed in Section 4.1. Another issue with this approach is that the windows which span across two sequences are 6.2 Variations of the Basic Technique an artifact of the concatenation and might affect the In the basic technique, a query pattern is considered to performance of the technique. “occur” in a sequence if is a substring of the sequence. An issue with considering substrings is that it forces the query pattern to occur exactly. If is long, it is unlikely for entire 6 PATTERN FREQUENCY-BASED ANOMALY to occur in a training sequence. Another issue is that in DETECTION many domains, it is reasonable to assume that the symbols Techniques under this category solve the following anom- of the query pattern can occur interspersed with other aly detection problem: symbols, and hence only considering substring matches will miss such occurrences. To address these issues, following Definition 4. Given a short query pattern , a long test sequence three variations of the basic technique have been proposed: t, and a training set of long sequences S, determine if the frequency of occurrence of in t is anomalous with respect to . Counting the number of times asubstringof the query frequency of occurrence of in S. pattern occurs in a sequence. Keogh et al. [77] find the largest l in the interval ½1; l Þ, such that every l length This problem has also been referred to as surprise substring of occurs at least once in the training detection in the context of time series data [69], [77]. Keogh sequences. et al. [77] define a pattern to be surprising “if the frequency of The frequency fS ð Þ from (15) is then replaced the pattern differs substantially from that expected by chance, with the following quantity: given some previously seen data.” Note that patterns with both QmÀl greater or smaller frequencies are of interest. j¼1 fS ð j:jþl Þ The pattern frequency-based formulation is motivated fS ð Þ ) QmÀl ; j¼2 fS ð j:jþlÀ1 Þ from Scenario 3 discussed in Section 3 where a pattern is anomalous if its frequency in the given sequence is where j:k denotes the substring of starting at significantly different from its expected frequency in jth location and ending at the kth location.

13.CHANDOLA ET AL.: ANOMALY DETECTION FOR DISCRETE SEQUENCES: A SURVEY 835 The idea is to estimate the occurrence of the query are multiple query patterns to be scored, a relative pattern using a set of substrings of . By searching ordering can be obtained using the anomaly scores, and for a set in which all substrings occur at least once, top few patterns with highest scores could be declared as the frequency estimation is more reliable. anomalous. But if there is only one query pattern, a . Counting the number of times the query pattern occurs method is needed for declaring the query pattern to be as a noncontiguous subsequence in a sequence. An issue anomalous or normal, based on its anomaly score. with the basic technique is that counting the number Gwadera et al. [79] address this issue by assuming that of times occurs as a subsequence in a long the relative frequency, ft ð Þ, is generated from a normal sequence is expensive. To make the formulation distribution $N ðE½f S ð ފ; V ar½f S ð ފÞ. The expected value more tractable, Gwadera et al. [79] extract all or mean, E½f S ð ފ, and the variance, V ar½f S ð ފ, of the windows of a fixed length (greater than the length distribution are estimated using sequences in S. The of ), from the sequence and then determine all anomaly score for the query pattern is computed as the those windows that contain as a subsequence. z-score of the observed relative frequency: Thus, a query pattern is considered to “occur” in a sequence if there exists a window such that is a f t ð Þ À E½f S ð ފ zscore ¼ qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi : ð17Þ subsequence of the given window. The number of V ar½f S ð ފ fixed-length windows which contain as a subse- quence is counted. These counts are used as ft ð Þ A threshold on the zscore is used to determine if the and fsi ð Þ in (14) and (15), respectively. occurrence of is anomalous or not. . Counting the number of times any permutation of the query pattern occurs as a noncontiguous subsequence in 6.4 Relationship between Pattern Frequency-Based a sequence. This alternative was proposed by Gwadera and Sequence-Based Anomaly Detection et al. [80] as an extension to the subsequence Techniques matching technique [79]. For the permutation ap- Pattern frequency-based anomaly detection techniques proach, the number of fixed-length windows which assign an anomaly score to a short pattern , while the contain any permutation of is counted. The basic window-based sequence anomaly detection technique motivation behind considering all permutations is (see Section 4.2) assigns an anomaly score to each short that in certain scenarios, the ordering of events (or window belonging to the test sequence. Both techniques symbols) within the query pattern does not matter. appear to be similar in this step, but they are actually quite distinct. While the basic technique discussed in this section 6.3 Issues with the Basic Technique and Techniques for Addressing Them assigns an anomaly score to a pattern based on its frequency in the test sequence and its average frequency in normal The basic technique and its variations have following two sequences, the basic window-based technique assigns an key issues: anomaly score to a window based on its similarity to the 6.3.1 Computational Complexity windows in the normal sequences. To be more precise, the basic window-based technique for sequence-based anomaly For each query pattern, the time required to compute its detection uses fS ð Þ as the anomaly score of each window, anomaly score is linear in the length of t and the length and while the basic technique for pattern frequency-based number of sequences in S. If there are multiple query anomaly detection compares this value with the relative patterns, e.g., all short contiguous subsequences that occur frequency of occurrence of the window in the test sequence in t, the total time required to score all query patterns adds to compute its anomaly score (see (16)). up to a high value. To address this issue, Keogh et al. Nevertheless, techniques for pattern frequency-based proposed a technique called TARZAN [69], [77] which uses formulation can also be used to find certain types of suffix trees to efficiently compute the frequency of occur- anomalies in the context of sequence-based anomaly rence of a query pattern in a given sequence. Suffix trees are detection. An alternate cause of an anomaly could be that created for t and for each sequence in S.2 Only two suffix a test sequence is anomalous because it contains one or trees are required, one for the sequences in S and for the more patterns whose frequency of occurrence in the test sequence t, and can be constructed with complexity linear sequence is significantly different from their frequency of in the length of the sequences. The counts for a query occurrence in the training sequences. Such anomalous pattern , can be obtained with complexity linear in the sequences can be detected as follows: for the given test length of . Gwadera et al. [81] use Interpolated Markov sequence, fixed-length windows are extracted. Each win- Models to efficiently find the number of windows extracted dow is assigned an anomaly score using the basic technique from a sequence that contain the query pattern. that deals with the pattern frequency-based formulation. 6.3.2 Scoring of Anomalies The anomaly scores of all windows are aggregated to obtain an overall anomaly score for the test sequence. The basic technique assigns an anomaly score to the query A direction adaptation of contiguous subsequence-based pattern (15) but does not declare if the query pattern is anomaly detection techniques to solve the pattern fre- anomalous or not. For the TARZAN technique, since there quency-based formulation, and vice versa, is not feasible, 2. TARZAN was originally proposed to handle the case when S contains since contiguous subsequence-based formulation deals only one sequence. with only one sequence while pattern frequency-based

14.836 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 24, NO. 5, MAY 2012 formulation requires a training set of sequences, a test formulations are not exhaustive and the anomaly detection sequence, and a query pattern. problem might be formulated in other ways also, though most of the existing work can be covered under the three formulations discussed here. 7 ONLINE ANOMALY DETECTION For each problem formulation, there are distinct groups Several application domains collect sequence data in a of techniques that use a specific approach to detect streaming fashion [82], e.g., system call data collected by a anomalies. Within each group, we have identified a basic computer system, data generated by an aircraft during its technique and shown how different existing techniques are flight, etc. Such domains often require the anomalies to be variations of the corresponding basic technique. This results detected in such sequences in an online fashion, i.e., as soon in a better understanding of the current state of research as as they occur [83], [84]. Online anomaly detection has the well as allows future researchers to develop novel varia- advantage that it can allow analysts to undertake pre- tions. For example, in the similarity-based techniques for ventive or corrective measures as soon as the anomaly is sequence-based formulation (Section 4.1), using a different manifested in the sequence data. combination of a point-based anomaly detection algorithm For example, in aircraft health monitoring, the current and a similarity measure, a different anomaly detection flight sequence for an aircraft is tested if it is anomalous or technique can be developed. Similarly, by using different techniques to compare a pair of windows, one can develop not, with respect to a database of historical normal flight novel variations of the basic window-based technique sequences of the aircraft. Determining that the current flight discussed in Section 5.1. In addition, techniques from one sequence has an anomaly, as soon as it occurs (even before problem formulation can be adapted to solve a different the entire flight sequence is collected) might help the health formulation, thereby enriching the set of techniques to monitoring system to raise an early alarm. choose from, for a given formulation. Among the different categories of techniques that handle Although the focus of this survey paper has been on the sequence-based formulation, some can be easily discrete sequences, many of the techniques discussed here adapted to operate in an online fashion. The window-based are also applicable to continuous sequences (or time series). and Markovian techniques assign anomaly score to win- For example, similarity-based techniques can be adapted for dows (or symbols) as they are collected. By applying a continuous sequences by using an appropriate similarity/ threshold on the anomaly score, the sequence can be distance measure, such as euclidean Distance [86], [87] and declared to be anomalous even before observing the entire Cross Correlation [88], [89]. Window-based techniques can be sequence. Hence, such techniques can be easily adapted to adapted by using the euclidean distance to kth closest operate in an online fashion. For example, Larrahondo et al. window from the training sequence set to assign anomaly [85] have proposed an online HMM-based anomaly detec- score to the window [87]. Markovian techniques can be tion technique which can compute the optimal state adapted by substituting the Markovian model with a time sequence for a given observation sequence in an online series prediction model, which can determine the likelihood fashion by modifying the forward backward algorithm. In of observing a real valued event, given a finite history of contrast, similarity-based techniques measure the similarity events [90]. HMM-based techniques can be adapted by of entire test sequence with training sequences, and hence using a continuous version of HMM [91]. While some of are not suitable for online detection problem. these adaptations have already been proposed, others are Techniques for the subsequence-based formulation can subject of future research. also be easily adapted to detect anomalies in an online This paper has focused on techniques that deal with univariate discrete sequences. Many real-world applica- fashion. In the online setting, each successive subsequence tions deal with sequences of multivariate events, where the of symbols could be assigned an anomaly score with respect events might contain discrete [92], continuous [93], or a to the sequence observed so far. A threshold on the score mixed set of observations [94]. Some of the techniques computed for each window can be used to declare it to be discussed in this paper are relevant to such settings, though normal or anomalous, as soon as it is observed. To obtain additional thought still needs to be given on how to handle reliable anomaly scores, such techniques will require such complex sequences. For example, similarity and observation of a significantly long portion of the sequence window-based techniques for sequence-based formulation, initially before scoring the incoming subsequences. as well as the basic technique for subsequence-based The pattern frequency-based formulation is more diffi- problem formulation can be easily extended to multivariate cult to handle in an online setting, in which the test sequences as long as a similarity measure can be developed sequence S is collected in an online fashion. The reason is to compare two multivariate discrete sequences. Recent that the frequency of occurrence of the query pattern in the work in the area of ranking patterns in sequences of item test sequence will have to be estimated without observing sets [95] can be extended to identify subsequence-based the entire test sequence. anomalies (low frequency patterns) in multivariate discrete sequences. Similarly, the recently proposed multievent pattern discovery [92] can be employed to identify pattern 8 CONCLUDING REMARKS frequency-based anomalies. One of the most interesting aspect of the problem of While the literature on anomaly detection for discrete anomaly detection for sequences is the rich diversity of sequences is rich, there are several research directions that problem formulations. In this survey, we have discussed need to be explored in the future. In this survey, we have three different problem formulations that are relevant in provided a high level comparison of strengths and limita- varied application domains. We note that these three tions of various techniques. An experimental study of these

15.CHANDOLA ET AL.: ANOMALY DETECTION FOR DISCRETE SEQUENCES: A SURVEY 837 techniques is essential for a more in-depth understanding of [18] T. Lane and C.E. Brodley, “Temporal Sequence Learning and Data Reduction for Anomaly Detection,” ACM Trans. Information their characteristics. Adapting existing solutions to other Systems and Security, vol. 2, no. 3, pp. 295-331, 1999. related problems, such as online anomaly detection and [19] G. Liu, T.K. McDaniel, S. Falkow, and S. Karlin, “Sequence handling multivariate sequences, is also an important Anomalies in the cag7 Gene of the Helicobacter Pylori Pathogeni- city Island,” Proc. Nat’l Academy of Sciences USA, vol. 96, no. 12, direction for future research. pp. 7011-7016, 1999. [20] A.N. Srivastava, “Discovering System Health Anomalies Using Data Mining Techniques,” Proc. Joint Army Navy NASA Airforce ACKNOWLEDGMENTS Conf. Propulsion, 2005. This work was supported by NASA under award [21] S. Chakrabarti, S. Sarawagi, and B. Dom, “Mining Surprising Patterns Using Temporal Description Length,” Proc. 24th Int’l NNX08AC36A and US National Science Foundation (NSF) Conf. Very Large Data Bases, pp. 606-617, 1998. Grants CNS-0551551 and IIS-0713227. Access to computing [22] D. Pavlov and D. Pennock, “A Maximum Entropy Approach to Collaborative Filtering in Dynamic, Sparse, High-Dimensional facilities was provided by the Digital Technology Con- Domains,” Proc. Advances in Neural Information Processing Systems, sortium. Varun Chandola was with the Department of 2002. Computer Science and Engineering, University of Minne- [23] D. Pavlov, “Sequence Modeling with Mixtures of Conditional Maximum Entropy Distributions,” Proc. Third IEEE Int’l Conf. Data sota, during the time of writing this paper. Mining, pp. 251-258, 2003. [24] S. Ramaswamy, R. Rastogi, and K. Shim, “Efficient Algorithms for Mining Outliers from Large Data Sets,” Proc. ACM SIGMOD Int’l REFERENCES Conf. Management of Data, pp. 427-438, 2000. [1] V. Chandola, A. Banerjee, and V. Kumar, “Anomaly Detection - A [25] E. Eskin, A. Arnold, M. Prerau, L. Portnoy, and S. Stolfo, “A Survey,” ACM Computing Surveys, vol. 41, no. 3, pp. 1-58, July Geometric Framework for Unsupervised Anomaly Detection,” 2009. Applications of Data Mining in Computer Security, pp. 78-100, [2] V. Hodge and J. Austin, “A Survey of Outlier Detection Kluwer Academics, 2002. Methodologies,” Artificial Intelligence Rev., vol. 22, no. 2, pp. 85- [26] A.K. Jain and R.C. Dubes, Algorithms for Clustering Data. Prentice- 126, 2004. Hall, Inc., 1988. [3] A. Lazarevic, L. Ertoz, V. Kumar, A. Ozgur, and J. Srivastava, “A [27] J. Yang and W. Wang, “CLUSEQ: Efficient and Effective Sequence Comparative Study of Anomaly Detection Schemes in Network Clustering,” Proc. Int’l Conf. Data Eng., pp. 101-112, 2003. Intrusion Detection,” Proc. SIAM Int’l Conf. Data Mining, May [28] D. Ron, Y. Singer, and N. Tishby, “The Power of Amnesia: 2003. Learning Probabilistic Automata with Variable Memory Length,” [4] S. Forrest, C. Warrender, and B. Pearlmutter, “Detecting Intru- Machine Learning, vol. 25, nos. 2/3, pp. 117-149, 1996. sions Using System Calls: Alternate Data Models,” Proc. IEEE [29] I. Cadez, D. Heckerman, C. Meek, P. Smyth, and S. White, Symp. Security and Privacy (ISRSP), pp. 133-145, 1999. “Visualization of Navigation Patterns on a Web Site Using Model- [5] S.A. Hofmeyr, S. Forrest, and A. Somayaji, “Intrusion Detection Based Clustering,” Proc. Sixth ACM SIGKDD Int’l Conf. Knowledge Using Sequences of System Calls,” J. Computer Security, vol. 6, Discovery and Data Mining, pp. 280-284, 2000. no. 3, pp. 151-180,, 1998. [30] P. Smyth, “Clustering Sequences with Hidden Markov Models,” [6] C.C. Michael and A. Ghosh, “Two State-Based Approaches to Proc. Advances in Neural Information Processing Systems, vol. 9, 1997. Program-Based Anomaly Detection,” Proc. 16th Ann. Computer [31] R.R. Sokal and C.D. Michener, “A Statistical Method for Security Applications Conf., p. 21, 2000. Evaluating Systematic Relationships,” Univ. of Kansas Scientific [7] W. Lee, S. Stolfo, and P. Chan, “Learning Patterns from Unix Bull., vol. 38, pp. 1409-1438, 1958. Process Execution Traces for Intrusion Detection,” Proc. AAAI 97 [32] J.W. Hunt and T.G. Szymanski, “A Fast Algorithm for Computing Workshop AI Methods in Fraud and Risk Management, 1997. Longest Common Subsequences,” Comm. ACM, vol. 20, no. 5, [8] W. Lee and S. Stolfo, “Data Mining Approaches for Intrusion pp. 350-353, 1977. Detection,” Proc. Seventh USENIX Security Symp., 1998. [33] N. Kumar, V.N. Lolla, E.J. Keogh, S. Lonardi, and C.A. [9] F.A. Gonzalez and D. Dasgupta, “Anomaly Detection Using Real- Ratanamahatana, “Time-Series Bitmaps: A Practical Visualization Valued Negative Selection,” Genetic Programming and Evolvable Tool for Working with Large Time Series Databases,” Proc. SIAM Machines, vol. 4, no. 4, pp. 383-403, 2003. Int’l Conf. Data Mining (SDM), 2005. [10] B. Gao, H.-Y. Ma, and Y.-H. Yang, “Hmms (Hidden Markov [34] T. Lane and C.E. Brodley, “Sequence Matching and Learning in Models) Based on Anomaly Intrusion Detection Method,” Proc. Anomaly Detection for Computer Security,” Proc. AI Approaches to Int’l Conf. Machine Learning and Cybernetics, pp. 381-385, 2002. Fraud Detection and Risk Management, Fawcett, Haimowitz, [11] S. Budalakoti, A. Srivastava, R. Akella, and E. Turkov, “Anomaly Provost, and Stolfo, eds., pp. 43-49, 1997. Detection in Large Sets of High-Dimensional Symbol Sequences,” [35] M.M. Breunig, H.-P. Kriegel, R.T. Ng, and J. Sander, “Lof: Technical Report NASA TM-2006-214553, NASA Ames Research Identifying Density-Based Local Outliers,” Proc. ACM SIGMOD Center, 2006. Int’l Conf. Management of Data, pp. 93-104, 2000. [12] S. Budalakoti, A. Srivastava, and M. Otey, “Anomaly Detection [36] D. Endler, “Intrusion Detection: Applying Machine Learning to and Diagnosis Algorithms for Discrete Symbol Sequences with Solaris Audit Data,” Proc. 14th Ann. Computer Security Applications Applications to Airline Safety,” Proc. IEEE Int’l Conf. Systems, Man, Conf., pp. 268-279, 1998. and Cybernetics, vol. 37, no. 6, 2007. [13] P. Sun, S. Chawla, and B. Arunasalam, “Mining for Outliers in [37] H. Debar, M. Dacier, M. Nassehi, and A. Wespi, “Fixed vs. Sequential Databases,” Proc. SIAM Int’l Conf. Data Mining, 2006. Variable-Length Patterns for Detecting Suspicious Process Beha- [14] V. Chandola, V. Mithal, and V. Kumar, “A Comparative vior,” Proc. Fifth European Symp. Research in Computer Security, Evaluation of Anomaly Detection Techniques for Sequence Data,” pp. 1-15, 1998. Proc. Int’l Conf. Data Mining, 2008. [38] A.K. Ghosh, A. Schwartzbard, and M. Schatz, “Using Program [15] D. Gusfield, Algorithms on Strings, Trees, and Sequences: Computer Behavior Profiles for Intrusion Detection,” Proc. SANS Third Conf. Science and Computational Biology. Cambridge Univ. Press, 1997. and Workshop Intrusion Detection and Response, [16] S. Forrest, S.A. Hofmeyr, A. Somayaji, and T.A. Longstaff, “A edu/ghosh99learning.html, Feb. 1999. Sense of Self for Unix Processes,” Proc. IEEE Symp. Security [39] A. Ghosh, A. Schwartzbard, and M. Schatz, “Learning Program and Privacy (ISRSP ’96), pp. 120-128, Behavior Profiles for Intrusion Detection,” Proc. First USENIX forrest96sense.html, 1996. Workshop Intrusion Detection and Network Monitoring, pp. 51-62, [17] E. Eskin, W. Lee, and S. Stolfo, “Modeling System Call for Apr. 1999. Intrusion Detection Using Dynamic Window Sizes,” Proc. DARPA [40] J.B.D. Cabrera, L. Lewis, and R.K. Mehra, “Detection and Information Survivability Conf. and Exposition (DISCEX), citeseer. Classification of Intrusions and Faults Using Sequences of System, 2001. Calls,” SIGMOD Record, vol. 30, no. 4, pp. 25-34, 2001.

16.838 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 24, NO. 5, MAY 2012 [41] A.P. Kosoresow and S.A. Hofmeyr, “Intrusion Detection via [66] E. Keogh, J. Lin, S.-H. Lee, and H.V. Herle, “Finding the Most System Call Traces,” IEEE Software, vol. 14, no. 5, pp. 35-42, Unusual Time Series Subsequence: Algorithms and Applications,” Sept./Oct. 1997. Knowledge and Information Systems, vol. 11, no. 1, pp. 1-27, 2006. [42] D. Dasgupta and F. Nino, “A Comparison of Negative and Positive [67] E. Keogh, J. Lin, and A. Fu, “Hot SAX: Efficiently Finding the Most Selection Algorithms in Novel Pattern Detection,” Proc. IEEE Int’l Unusual Time Series Subsequence,” Proc. Fifth IEEE Int’l Conf. Conf. Systems, Man, and Cybernetics, vol. 1, pp. 125-130, 2000. Data Mining, pp. 226-233, 2005. [43] T. Lane and C.E. Brodley, “An Application of Machine Learning [68] J. Lin, E. Keogh, A. Fu, and H.V. Herle, “Approximations to to Anomaly Detection,” Proc. 20th Nat’l Information Systems Magic: Finding Unusual Medical Time Series,” Proc. 18th IEEE Security Conf., pp. 366-380, 1997. Symp. Computer-Based Medical Systems, pp. 329-334, 2005. [44] T. Lane, “Machine Learning Techniques for the Computer [69] J. Lin, E. Keogh, L. Wei, and S. Lonardi, “Experiencing SAX: A Security Domain of Anomaly Detection,” PhD dissertation, Novel Symbolic Representation of Time Series,” Data Mining and Purdue Univ., 2000. Knowledge Discovery, vol. 15, no. 2, pp. 107-144, 2007. [45] D. Dasgupta and N. Majumdar, “Anomaly Detection in Multi- [70] E. Keogh, S. Lonardi, and C.A. Ratanamahatana, “Towards dimensional Data Using Negative Selection Algorithm,” Proc. Parameter-Free Data Mining,” Proc. 10th ACM SIGKDD Int’l Conf. IEEE Conf. Evolutionary Computation, pp. 1039-1044, May 2002. Knowledge Discovery and Data Mining, pp. 206-215, 2004. [46] S. Forrest, P. D’haeseleer, and P. Helman, “An Immunological [71] L. Wei, N. Kumar, V. Lolla, E.J. Keogh, S. Lonardi, and C. Approach to Change Detection: Algorithms, Analysis and Implica- Ratanamahatana, “Assumption-Free Anomaly Detection in Time tions,” Proc. IEEE Symp. Security and Privacy, pp. 110-119, 1996. Series,” Proc. 17th Int’l Conf. Scientific and Statistical Database [47] S. Forrest, A.S. Perelson, L. Allen, and R. Cherukuri, “Self-Nonself Management, pp. 237-240, 2005. Discrimination in a Computer,” Proc. IEEE Symp. Security and [72] L. Wei, E. Keogh, and X. Xi, “Saxually Explicit Images: Finding Privacy, pp. 202-212, 1994. Unusual Shapes,” Proc. Sixth Int’l Conf. Data Mining, pp. 711-720, [48] S. Forrest and D. Dasgupta, “Novelty Detection in Time Series 2006. Data Using Ideas from Immunology,” Proc. Fifth Int’l Conf. [73] Y. Bu, T.-W. Leung, A. Fu, E. Keogh, J. Pei, and S. Meshkin, “Wat: Intelligence Systems, 1996. Finding Top-k Discords in Time Series Database,” Proc. Seventh [49] S. Forrest, F. Esponda, and P. Helman, “A Formal Framework for SIAM Int’l Conf. Data Mining, 2007. Positive and Negative Detection Schemes,” IEEE Trans. Systems, [74] A.W.-C. Fu, O.T.-W. Leung, E.J. Keogh, and J. Lin, “Finding Time Man and Cybernetics, Part B, vol. 34, no. 1, pp. 357-373, Feb. 2004. Series Discords Based on Haar Transform,” Proc. Second Int’l Conf. [50] A.K. Ghosh, J. Wanken, and F. Charron, “Detecting Anomalous Advanced Data Mining and Applications, pp. 31-41, 2006. and Unknown Intrusions against Programs,” Proc. 14th Ann. [75] A. Ghoting, S. Parthasarathy, and M.E. Otey, “Fast Mining of Computer Security Applications Conf., pp. 259-267, 1998. Distance-Based Outliers in High-Dimensional Datasets,” Proc. [51] M. Wang, C. Zhang, and J. Yu, “Native Api Based Windows SIAM Data Mining Conf., 2006. Anomaly Intrusion Detection Method Using SVM,” Proc. IEEE [76] R. Gwadera, A. Gionis, and H. Mannila, “Optimal Segmentation Int’l Conf. Sensor Networks, Ubiquitous, and Trustworthy Computing, Using Tree Models,” ICDM ’06: Proc. Sixth Int’l Conf. Data Mining, vol. 1, pp. 514-519, 2006. pp. 244-253, 2006. [52] S. Tian, S. Mu, and C. Yin, “Sequence-Similarity Kernels for Svms [77] E. Keogh, S. Lonardi, and B.Y.C. Chiu, “Finding Surprising to Detect Anomalies in System Calls,” Neurocomputing, vol. 70, Patterns in a Time Series Database in Linear Time and Space,” nos. 4-6, pp. 859-866, 2007. Proc. Eighth ACM SIGKDD Int’l Conf. Knowledge Discovery and Data [53] X. Li, J. Han, S. Kim, and H. Gonzalez, “Roam: Rule- and Motif- Mining, pp. 550-556, 2002. Based Anomaly Detection in Massive Moving Object Data Sets,” [78] J.J. Schlesselman, Case-Control Studies: Design, Conduct, Analysis Proc. Seventh SIAM Int’l Conf. Data Mining, 2007. (Monographs in Epidemiology and Biostatistics). Oxford Univ. Press, [54] N. Ye, “A Markov Chain Model of Temporal Behavior for 1982. Anomaly Detection,” Proc. Fifth Ann. IEEE Information Assurance [79] R. Gwadera, M. Atallah, and W. Szpankowski, “Reliable Detection Workshop, 2004. of Episodes in Event Sequences,” Knowledge and Information [55] C. Marceau, “Characterizing the Behavior of a Program Using Systems, vol. 7, no. 4, pp. 415-437, 2005. Multiple-Length N-Grams,” Proc. Workshop New Security Para- [80] R. Gwadera, M. Atallah, and W. Szpankowskii, “Detection of digms, pp. 101-110, 2000. Significant Sets of Episodes in Event Sequences,” Proc. Fourth IEEE [56] E. Eskin, W.N. Grundy, and Y. Singer, “Protein Family Classifica- Int’l Conf. Data Mining, pp. 3-10, 2004. tion Using Sparse Markov Transducers,” Proc. Int’l Conf. Intelligent [81] R. Gwadera, M.J. Atallah, and W. Szpankowski, “Markov Models Systems for Molecular Biology (ISMB ’08), pp. 134-145, 2000. for Identification of Significant Episodes,” Proc. Fifth SIAM Int’l [57] W.W. Cohen, “Fast Effective Rule Induction,” Proc. 12th Int’l Conf. Data Mining, 2005. Conf. Machine Learning, A. Prieditis and S. Russell, eds., pp. 115- [82] R.A. Maxion and K.M.C. Tan, “Benchmarking Anomaly-Based 123, July 1995. Detection Systems,” Proc. Int’l Conf. Dependable Systems and [58] L.R. Rabiner and B.H. Juang, “An Introduction to Hidden Markov Networks, pp. 623-630, 2000. Models,” IEEE ASSP Magazine, vol. 3, no. 1, pp. 4-16, Jan. 1986. [83] A. Pawling, P. Yan, J. Candia, T. Schoenharl, and G. Madey, [59] L.E. Baum, T. Petrie, G. Soules, and N. Weiss, “A Maximization “Anomaly Detection in Streaming Sensor Data,” Intelligent Technique Occuring in the Statistical Analysis of Probabilistic Techniques for Warehousing and Mining Sensor Network Data, IGI Functions of Markov Chains,” Annals of Math. Statistics, vol. 41, Global, 2008. no. 1, pp. 164-171, 1970. [84] D. Pokrajac, A. Lazarevic, and L.J. Latecki, “Incremental Local [60] T. Lane, “Hidden Markov Models for Human/Computer Inter- Outlier Detection for Data Streams,” Proc. IEEE Symp. Computa- face Modeling,” Proc. IJCAI-99 Workshop Learning about Users, tional Intelligence and Data Mining, 2007. pp. 35-44, 1999. [85] G. Florez-Larrahondo, S.M. Bridges, and R. Vaughn, “Efficient [61] K. Yamanishi and Y. Maruyama, “Dynamic Syslog Mining for Modeling of Discrete Events for Anomaly Detection Using Network Failure Monitoring,” KDD ’05: Proc. 11th ACM SIGKDD Hidden Markov Models,” Information Security, vol. 3650, pp. 506- Int’l Conf. Knowledge Discovery in Data Mining, pp. 499-508, 2005. 514, 2005. [62] J. Forney, G.D., “The Viterbi Algorithm,” Proc. IEEE, vol. 61, no. 3, [86] G.K. Palshikar, “Distance-Based Outliers in Sequences,” Proc. pp. 268-278, Mar. 1973. Second Int’l Conf. Distributed Computing and Internet Technology, [63] G. Florez, Z. Liu, S. Bridges, A. Skjellum, and R. Vaughn, pp. 547-552, 2005. “Lightweight Monitoring of Mpi Programs in Real Time,” [87] D. Yankov, E.J. Keogh, and U. Rebbapragada, “Disk Aware Concurrency and Computation: Practice and Experience, vol. 17, Discord Discovery: Finding Unusual Time Series in Terabyte no. 13, pp. 1547-1578, 2005. Sized Datasets,” Proc. Int’l Conf. Data Mining, pp. 381-390, 2007. [64] Y. Qiao, X.W. Xin, Y. Bin, and S. Ge, “Anomaly Intrusion [88] P. Protopapas, J.M. Giammarco, L. Faccioli, M.F. Struble, R. Dave, Detection Method Based on Hmm,” Electronics Letters, vol. 38, and C. Alcock, “Finding Outlier Light Curves in Catalogues of no. 13, pp. 663-664, 2002. Periodic Variable Stars,” Monthly Notices of the Royal Astronomical [65] X. Zhang, P. Fan, and Z. Zhu, “A New Anomaly Detection Soc., vol. 369, no. 2, pp. 677-696, 2006. Method Based on Hierarchical Hmm,” Proc. Fourth Int’l Conf. [89] U. Rebbapragada, P. Protopapas, C.E. Brodley, and C. Alcock, Parallel and Distributed Computing, Applications and Technologies, “Finding Anomalous Periodic Time Series,” Machine Learning, pp. 249-252, 2003. vol. 74, pp. 281-313, 2009.

17.CHANDOLA ET AL.: ANOMALY DETECTION FOR DISCRETE SEQUENCES: A SURVEY 839 [90] J. Ma and S. Perkins, “Online Novelty Detection on Temporal Vipin Kumar received the BE degree in Sequences,” Proc. Ninth ACM SIGKDD Int’l Conf. Knowledge electronics and communication engineering Discovery and Data Mining, pp. 613-618, 2003. from Indian Institute of Technology Roorkee [91] Z. Liu, J.X. Yu, L. Chen, and D. Wu, “Detection of Shape (formerly, University of Roorkee), India, in Anomalies: A Probabilistic Approach Using Hidden Markov 1977, the ME degree in electronics engineering Models,” Proc. IEEE 24th Int’l Conf. Data Eng., pp. 1325-1327, from Philips International Institute, Eindhoven, Apr. 2008. Netherlands, in 1979, and the PhD degree in [92] R. Gwadera and F. Crestani, “Discovering Significant Patterns in computer science from the University of Mary- Multi-Stream Sequences,” Proc. Eighth IEEE Int’l Conf. Data land, College Park, in 1982. He is currently Mining, pp. 827-832, 2008. William Norris professor and head of the [93] H. Cheng, P.-N. Tan, C. Potter, and S. Klooster, “Detection and Computer Science and Engineering Department at the University of Characterization of Anomalies in Multivariate Time Series,” Proc. Minnesota. His current research interests include data mining, high- Ninth SIAM Int’l Conf. Data Mining, 2009. performance computing, and their applications in Climate/Ecosystems [94] R. Fujimaki, T. Nakata, H. Tsukahara, and A. Sato, “Mining and Biomedical domains. He has authored more than 250 research Abnormal Patterns from Heterogeneous Time-Series with Irrele- articles, and has coedited or coauthored 11 books including widely vant Features for Fault Event Detection,” Proc. SIAM Int’l Conf. used text books Introduction to Parallel Computing and Introduction to Data Mining, pp. 472-482, 2008. Data Mining. He is a founding co-editor-in-chief of Journal of [95] R. Gwadera and F. Crestani, “Ranking Sequential Patterns with Statistical Analysis, a cofounder of SIAM International Conference Respect to Significance,” Proc. 14th Pacific-Asia Conf. Knowledge on Data Mining, and editor of Data Mining and Knowledge Discovery Discovery and Data Mining, (PAKDD ’[10), pp. 286-299, 2010. Book Series published by CRC Press/Chapman Hall. He received the 2009 Distinguished Alumnus Award from the Computer Science Varun Chandola received the PhD degree from Department, University of Maryland, College Park, and 2005 IEEE the University of Minnesota, Twin Cities, in 2009. Computer Society’s Technical Achievement Award for contributions to He is a postdoctoral research associate in the the design and analysis of parallel algorithms, graph partitioning, and Geographic Information Science and Technol- data mining. He is a fellow of the ACM, IEEE, and AAAS. ogy group at Oak Ridge National Labs. His areas of expertise include data mining, time series data analysis, machine learning, and . For more information on this or any other computing topic, algorithm development. His research focus is please visit our Digital Library at in the area of anomaly detection applied to discrete sequences and time series data. He has applied his research, with significant success, to varying application domains, such as biomass monitoring, aviation safety, cyber intrusion detection, tax fraud analysis, cardiac health monitoring, and click fraud analysis. Arindam Banerjee received the PhD degree from the University of Texas at Austin in 2005. He is an assistant professor and a McKnight Land grant professor in the Department of Computer Science and Engineering at the University of Minnesota, Twin Cities. His re- search interests are in data mining and machine learning, and their applications to real-world problems. His work currently focuses on statis- tical and graphical models for learning and predictive modeling with large-scale data. His research interests also include information theory and convex analysis, and applications in complex real-world learning problems including problems in text and web mining, bioinformatics, and social network analysis. He is a member of the IEEE.