# Event Regularity and Irregularity in a Time Unity

In this paper, we study the problem of learning a regular model from a number of sequences, each of which contains events in a time unit. Assuming some regularity in such sequences, we determine what events should be deemed irregular in their contexts. We perform an in-depth analysis of the model we build, and propose two optimization techniques, one of which is also of independent interest in solving a new problem named the Group Counting problem. Our comprehensive experiment son real and hybrid data sets show that the model we build is very effective in characterizing regularities and identifying irregular events. One of our optimizations improves model building speed by more than an order of magnitude, and the other significantly saves space consumption.

1. Event Regularity and Irregularity in a Time Unit Lijian Wan Tingjian Ge University of Massachusetts, Lowell University of Massachusetts, Lowell lwan@cs.uml.edu ge@cs.uml.edu Abstract—In this paper, we study the problem of learning going up and down with the selected altitude) would have been a regular model from a number of sequences, each of which automatically detected, and the tragedy in the subsequent flight contains events in a time unit. Assuming some regularity in such could have been prevented. sequences, we determine what events should be deemed irregular in their contexts. We perform an in-depth analysis of the model Example 2: Elderly people (or persons with autism) who we build, and propose two optimization techniques, one of which live alone often lead a dangerous life. If they fall down on is also of independent interest in solving a new problem named the floor, or pass out in the bathroom, no one may notice it the Group Counting problem. Our comprehensive experiments for a long time. It would be useful to continuously sense the on real and hybrid datasets show that the model we build is very location, movement, and other signals from the person, and effective in characterizing regularities and identifying irregular have a software program automatically monitor the situation. events. One of our optimizations improves model building speed by more than an order of magnitude, and the other significantly These people’s normal daily activities are usually simple, upon saves space consumption. which a regular model may be built (the time unit here is one day) and abnormal events may be detected as soon as they happen. Many lives could be saved from such continuous I. I NTRODUCTION automatic monitoring. We live in “interesting times”. For computing, this is true Example 3: Consider a person who lives a normal life in that we are inundated with data, and computing power is with established and habitual daily activities, such as going to ever increasing. Much of this data has to do with ubiquitous office for work, having meetings with some group of people, computing — data given by sensors that detect various at- dining in a certain set of restaurants, entertaining at a number tributes of the state of the world, e.g., GPS (for locations), of places, and so on. Variations are possible, say, between temperature and light sensors, accelerometers and gyroscopes workdays and weekends. Automatic recognizing and logging (for movements and orientations), among many others that “irregular” events are often of interest. For instance, if a even come with smartphones today. Continuous sensing creates meeting takes place with someone who I do not usually interact relational sequence data where one of the attributes is time, with or at a place I do not typically visit, it may be useful to along with any number of other attributes. Each tuple in the have this logged. Or if anything significant happens, I may look sequence describes the state at an instant of time, which often back to see if any recent irregular events might have caused implies the event, activity, or action that is occurring. Certain this outcome. events and activities often have some patterns within a time unit (typically a day or the duration of a process), and the Event detection and activity recognition have been well- same pattern is repeated periodically over many time units. studied in the ubiquitous computing and artificial intelligence It is useful to find out the model of such regularity, and to communities (e.g., ). In the above motivating applications, identify irregular or unusual events when they happen. Let us each tuple at a time instant may have an event/activity label look at some examples. indicating the event or action that is happening. Each time unit — e.g., a flight in the pilot example or a day in the latter Example 1: On March 24, 2015, the copilot Lubitz crashed two examples — has a sequence of tuples, and we may have Germanwings Flight 4U9525 in the French Alps, killing all a number of such sequences, from which a regular model is 150 people on board. Data shows that in the flight into to be learned. Using this regular model, irregular events are Barcelona prior to the doomed flight, Lubitz practiced a further identified. Outlier/anomaly detection has been studied controlled descent, where he toyed with the plane’s settings, for sequence data (cf. a survey ), temporal data (cf. a survey programming it for sharp descent multiple times . The ), and in more general contexts (cf. a survey ). However, “selected altitude” of the flight changed repeatedly, including previous work is designed for different problems and is very several times being set as low as 100 feet above the ground. different from ours. We discuss it in detail in Sec. VIII (related Lubitz also put the engines on idle, which gives the plane the work) and Sec. VII (experiments). ability to quickly descend. It is highly unusual for a pilot to do these. The airline company technically could have known about Our Contributions. We begin with formally stating the Lubitz’ apparent rehearsal on this previous flight, but only if it problem, and proposing to cluster events into an event hi- had looked at the flight data in the short period while the plane erarchy tree. This abstracts events into higher level com- was unloading and loading passengers in Barcelona. There monalities, which we call generalized events, to build into are various sensors in a modern airplane to record operations the regular model. In this way, even though specific events and settings of the pilot. Suppose a “regular” model for a may differ, variations can be permitted, and the higher level single flight (which is the time unit mentioned above) had common characteristics of the events are recognized in the been built, irregular and unusual events during this flight (i.e., model. Then we first study how to build a regular model. 978-1-5090-2020-1/16/\$31.00 © 2016 IEEE 930 ICDE 2016 Conference

2.Our algorithm performs message passing progressively and builds the regular model at the same time. This framework is easily parallelizable. The resulting model characterizes events and their contexts. Next we devise a method that returns the regularity score of each event in a sequence, based on the model. Our experiments show that our model is very effective in identifying irregular events. We further perform an in-depth analysis of property of the model graph. We adaptively estimate the probability of creating a new edge into the model graph during the model building Fig. 1: Illustrating the event hierarchy tree created by Cobweb. process, based on a novel two function recursive approach, also incorporating runtime information obtained from actual data. This results in an optimization that we can stop the model building algorithm early when the new edge probability Given S, the first aspect of our problem is to obtain a is very small. Our experiments show that this optimization model M that characterizes the regularity of the sequences in is very effective. The performance improvement is dramatic, S. Most, if not all, sequences in S share this same model. often over an order of magnitude. Moreover, there is virtually We call it a regular model. The second aspect of our problem no loss in accuracy. The precision is 1, and the recall is 0.99 is to return a function R, such that given the regular model or higher for edges in the model graph. M and any sequence s (regardless of whether s ∈ S), one Another optimization is to reduce memory consumption. can obtain the regularity score of any event in s, denoted as We first abstract out a general problem called the Group R(s, j, M) for the jth event in s. Moreover, one can obtain the Counting Problem, which is of independent interest. It is a regularity score of the sequence s, denoted as R(s, M). Such generalization of the commonly known set membership prob- regularity scores are between 0 and 1, with 1 being the most lem in computer science. We devise a low memory footprint regular and 0 the least. Note that, in the problem statement, approximate solution named blended bitmap set (BBS). Our we intentionally do not restrict the form of the regular model experiments show that with BBS our regular model building or the score function, in order to make the possible solutions uses significantly less memory (e.g., one eighth) without losing flexible. much accuracy and model power. Finally, we carry out a comprehensive experimental study to examine our regular B. Preprocessing Events model properties, effectiveness in finding irregular events, and For building a regular model, we propose to preprocess the efficiency of our algorithms. Our contributions can be the events in the data into an event hierarchy tree T. The summarized as follows: idea is that a (new) generalized higher-level event incorporates • We identify and formulate an interesting problem with multiple more specific events, hence allowing us to model practical significance (Sec. II). variations of the same general event. For example, the events • We propose an algorithmic framework based on mes- “drinking coffee” and “drinking tea” can be generalized to a sage passing and event hierarchies to build a regular higher level event “drinking beverage”, even though “drinking model, which can be easily parallelized (Sec. III). beverage” is not an event in the original sequence set S (while • We propose a method that returns the regularity of “drinking coffee” and “drinking tea” are). Similarly, suppose all events in a sequence, which is shown to be very a person sometimes has lunch at restaurant A, sometimes at effective in experiments (Sec. IV). restaurant B. Then the events “lunch at restaurant A” and • We perform two optimizations: one is to estimate “lunch at restaurant B” can be generalized into a higher level a statistic dynamically at runtime and can stop the event “lunch at a restaurant”, or even further, just “having execution early, while the other is a new compact ap- lunch”. All original events in S appear as leaves of the event proximate data structure to significantly save memory tree T, while internal nodes of T are higher-level events. Thus, consumption (Sec. V, VI). if different sequences in S show variations of a general event, • We perform a systematic experimental study to eval- their commonality can still be modeled, knowing the event uate our approaches (Sec. VII). hierarchy T. An event in that context that does not belong to the more general event can be an irregular event. II. P ROBLEM S TATEMENT AND P REPROCESSING We merely propose the general strategy of clustering events into an event hierarchy T. There can be multiple ways to A. Problem Statement implement this strategy. In fact, if nothing is done, all the We are given a set of event sequences S = {s1 , s2 , . . . , sm } original events in S can be deemed a “flat” tree, as leaves under as input, where each sequence si (1 ≤ i ≤ m) consists of a a single root node. Hence, our subsequent algorithm framework list of events within the duration of a time unit. Thus, the time is always applicable to any event data. Such clustering is units are repeated, and each time unit is a time interval, such as based on the attributes in an event. We use a technique called a day or the duration of a flight. The whole set S spans m time conceptual clustering  in machine learning. Given a set units. The events in si are denoted as si ,si , . . . Note that of events (or object descriptions in general), it produces a si ’s may be of different lengths. Each event si [j] is a tuple classification scheme over the events based on their attributes. with a number of attributes including a timestamp attribute, Specifically, we use the Cobweb algorithm provided as part which is monotonically increasing within each sequence si . of the commonly used Weka machine learning tools . Fig. 931

3.1 shows a snippet of the event tree produced by Cobweb on Algorithm 1: B UILD R EGULAR M ODEL (S, T) one of our test datasets. After this conceptual clustering, each Input: S: a set of m event sequences, sequence si in S corresponds to a sequence of leaf nodes of T: an event hierarchy tree T. Each of these leaf nodes e belongs to more general event Output: a regular model M types that are e’s ancestors in T. 1 G ←(V, E) where V is nodes of T and E is to be determined 2 set of agents A ← leaves of T 3 for each s ∈ S in parallel do III. B UILDING A R EGULAR M ODEL 4 for each event s[i] ∈ s in parallel do 5 agent a[i] initiates a message and sends to a[i + 1] In this section, we propose an approach to solve the first aspect of the problem stated in Section II-A, namely to build 6 R(u, v) ← 0, ∀u, v ∈ V, u = v a regular model. This takes as input a set of event sequences 7 R(v, v) ← 1, ∀v ∈ V S and the event hierarchy tree T discussed in Section II-B. 8 for each time step t do Intuitively, the general idea is that an event e is “regular” if it 9 for each agent a in parallel do “fits in” its “context”, in terms of what other events are before 10 if a receives a message msg then and after it. If this context is common for e to be in, then e is 11 a0 ← msg’s initiator regular. However, one has to keep in mind that the relevant 12 for each v1 ∈ ancestors (a0 ), v2 ∈ ancestors(a) do surrounding events in the context may not be immediately 13 if R(v1 , v2 ) = 1 then preceding or succeeding e, as some random events could occur 14 continue in between. 15 set the bit in bm(v1 , v2 ) for a’s sequence Our algorithm is based on “message passing”. Each event 16 if |bm(v1 , v2 )| > τ · m then (in each sequence of S) initiates a message to its right neighbor 17 E ← E (v1 , v2 ) event. These messages propagate one step at a time to the 18 l ← max(l(v1 ), l(v2 )) receiver’s right neighbor. During this progressive propagation, 19 for each x ∈ V do 20 for each y ∈ V do each event pair (message initiator, receiver) is added to a 21 if max(l(x), l(y)) ≥ l and counter maintained for each event pair. When the counter for 22 R(x, v1 ) = R(v2 , y) = 1 then an event pair is above a threshold, this event pair is added 23 R(x, y) ← 1 as an edge in a graph, which we will eventually return as the regular model. The event pairs we keep track of include general event types as well, i.e., the internal nodes of T. Thus, the model tolerates variations of events belonging to a general 24 for each s ∈ S in parallel do type, although an edge over a more general event type carries 25 for each event s[i] ∈ s in parallel do 26 agent a[i] propagates its message to a[i + 1] less weight when we compute the “regularity score” in Section IV. 27 return G as regular model M Another point is that the regular model that we return is concise. During the message propagation, we do not add an edge from event e1 to event e2 to the model graph if there already exists a path from e1 to e2 in the model graph (which In lines 6-7, the binary matrix R(u, v) is to keep track of means each edge in this path takes fewer message passing “reachability” from vertex u to vertex v in the model graph G steps). Thus, if there is a common event sequence a → b → so far. More precisely, R(u, v) = 1 if there exists a path from u c, then we only have two edges a → b and b → c in the to v where every vertex on this path has a level no greater than model graph, but not a → c. Besides efficiency, another reason max(l(u), l(v)). This is exactly the condition under which we for doing this is that, in considering an event e’s “context”, do not create a new edge from u to v. Note that if there is a we would prefer events closer to e, even though we do allow vertex on this path with a level greater than those of both u interleaving irrelevant events. and v, it is still beneficial to add a direct edge from u to v, since it does not go through a more general event (which is a For a node v in T , we define the level of v, denoted as stronger relationship). Lines 6-7 are to initialize R such that a l(v), to be the maximum number of edges to traverse from vertex is only reachable to itself. v to a leaf that is a descendant of v (i.e., in v’s subtree). Thus, all leaves have a level of 0. We show the algorithm At each time step (line 8), the receiver agent receives in B UILD R EGULAR M ODEL. the message from the sender agent. We call it a time step here, which simply means the propagation of one hop of the In line 1, we initialize a model graph, where the vertices messages. Line 11 gets the original initiator of the message are just the nodes of T, and the edges are to be determined in (the event/agent) who creates the message in the beginning this algorithm. In line 2, we start an “agent” for each distinct (i.e., line 5). In line 12, we iterate through each ancestor of leaf event. These agents will be responsible for sending and a0 and each ancestor of a in the input event tree T, where a0 receiving messages. In lines 3-4, we use the keyword “in and a correspond to leaves of the tree. If there is a path from parallel” to emphasize the fact that each loop can be executed v1 to v2 (lines 13-14), then we do not bother to create a new independently in parallel. Indeed, our message passing algo- edge. Otherwise, in line 15, we “increment the counter” for rithmic framework is highly parallelizable. For event s[i] in this event pair v1 , v2 . However, this is not a simple counter, sequence s, we denote its agent as a[i]. In line 5, a[i + 1] is as each sequence in S can only be counted once. That is, we the agent for the event’s right neighbor. keep track of, out of the m sequences in total, how many have 932

5. Algorithm 2: C OMPUTE R EGULARITY (s, M) Input: s: an event sequence, M: regular event model Output: regularity r[i] of each event s[i], and regularity r of s 1 for each event s[i] ∈ s in parallel do 2 agent a[i] initiates a message and sends to a[i + 1] 3 initialize pre[i] and post[i] to 0 4 for each time step t do Fig. 2: Illustrating the impact of a missing event “e” from a 5 for each agent a[i] in parallel do sequence. (a) A regular model where missing e will only make events d and f irregular. (b) A regular model where missing e 6 if a[i] receives a message msg then will not have a significant impact on the regularity scores of 7 a[j] ← msg’s initiator other events in the sequence. 8 for each v0 ∈ ancestors (a[j]), 9 v ∈ ancestors(a[i]) do 10 if (v0 , v) ∈ M then 11 w ← γ l(v0 )+l(v) 12 if w > pre[i] then probability that there exists an edge to an event n time steps 13 pre[i] ← w later (i.e., an edge that spans n hops), and let qn denote the 14 if w > post[j] then probability that there exists a path to an event n time steps 15 post[j] ← w later. We devise a novel two function (pn and qn ) recursive approach to solve pn and qn simultaneously. 16 for each letter s[i] ∈ s in parallel do We further define fn as the probability that an event s[i] in 17 agent a[i] propagates its message to a[i + 1] S and an event s[i + n] is a frequent pair in S, i.e., the event order s[i] followed by s[i+n] appears in more than τ sequences 18 for each event s[i] ∈ s in parallel do in S. fn can be estimated as the empirical probability as 19 r ← pre[i] · post[i] B UILD R EGULAR M ODEL is run and messages are propagated |s| r[i] incrementally. 20 r← i=1 |s| We then have: n−1 metric is very effective in identifying irregular events within pn = fn · (1 − pi qn−i ) (1) a sequence, even though the sequence itself is overall regular i=1 (having a high regularity score). Note that an event at the very n beginning (or very end, resp.) of a sequence may have the qn = pi qn−i (2) pre[·] score (or post[·] score, resp.) being 0, which gives a i=1 low regularity score. However, we can quickly find such an event at the border of a sequence and only look at its post[·] In Equation (1), there is an edge to a vertex n hops away score (or pre[·] score, resp.) that truly indicates its regularity. (pn ) if and only if the pair is frequent (fn ) and there is no path n−1 with multiple edges between the two nodes (1− i=1 pi qn−i ). As another remark, consider the scenario that an event e The case of multiple-edge path is subdivided according to the is missing from a sequence. What impact does it have on the number of hops of the first edge in the path (i.e., pi ), and regularity scores of other events in this sequence? Informally, the remaining path (qn−i ). Similarly, in Equation (2), the case it depends on how “important” this event e is to other events of existing a path of n hops is subdivided according to the (with respect to the regular model M). If M is “linear” around number of hops of the first edge in the path (i.e., pi ). The e, like the one shown in Fig. 2(a), missing e would only affect boundary conditions are q0 = 1 and p1 = f1 . d’s post[·] score and f ’s pre[·] score — hence only d and f ’s regularity scores. If, on the other hand, M is like in Fig. 2(b), It is not hard to see that, by applying Equations (2) and where e’s incoming neighbor d has multiple outgoing edges, (1) alternately, we can iteratively obtain all qi ’s and pi ’s up to and e’s outgoing neighbor f has multiple incoming edges, qn and pn (i.e., in the order q1 , p2 , q2 , p3 , and so on). We can then the regularity scores of these other events would not be stop the process, as well as the algorithm, when pn is small significantly changed by the absence of e. enough (i.e., when n is big enough). A few remarks are in place. The above analysis serves V. A NALYSIS AND O PTIMIZATION both as understanding of the model graph property (i.e., how Intuitively, during the B UILD R EGULAR M ODEL algorithm, likely an edge crosses over n events in the sequence), and as as a message propagates a long way, the chance that it will a performance optimization in that we can stop the algorithm form new edges in G becomes very small. This is because it early when pn is small enough (i.e., the model is stable). is more likely to have a “path” between two farther nodes as Secondly, we choose to empirically measure fn and an- the number of possible paths increases. Hence, when the edge alytically derive pn because fn is much easier (in terms of probability becomes low enough, we may stop the lengthy accuracy) to be measured empirically (as a single frequency message propagation process. We now rigorously analyze this number for the whole set of sequences S). This is because idea. typically pn fn when n is large, as evidenced by Equation From an agent that initiates a message, let pn denote the (1) showing pn < fn and shown empirically. It is well 934

6.known that a very small probability number pn (when n is large) is extremely difficult to estimate empirically (requiring a very large sample size) . Furthermore, the frequency measurement fn is increasingly more accurate as time step progresses (n increases) since more event pairs have been seen so far as the basis for determining if the frequency threshold τ · m is reached. Our experiments in Section VII show that (1) pn converges quickly to close to 0 (Fig. 14), (2) the performance improve- ment with early stop is dramatic, often more than an order of magnitude (Fig. 15), and (3) there is hardly any loss on Fig. 3: Illustrating a BBS structure where k = 3, w = 5, and accuracy, with precision 1 and recall 0.99 or higher (Fig. 16). m = 4. That is, there are 3 hash functions (one for each row), The details are in Section VII. Thus, this optimization should each hash function have 5 possible values (thus 5 columns), always be performed. and there is a bitmap of 4 bits at each cell (for 4 groups). An empty cell indicates its bitmap is all 0’s now. The final remark is that this is a dynamic analysis and estimation on the fly based on statistics collected at execution time. It is impossible to do this statically, as the result depends on the actual data. each of the k hash functions hi to e and get the hash value be hi (e). We take log w bits from hi (e), which corresponds to a VI. F URTHER E NHANCEMENT column (say, column j) in the ith row. Then we set the group information (group g) in the bitmap of cell (i, j), denoted as Recall that, in the B UILD R EGULAR M ODEL algorithm, we bm[i, j]. However, we do not directly set bit g of bm[i, j]. need to maintain a bitmap for each ordered pair (u, v) in Instead, we apply a random permutation permi over the m bits, S, which may be very space consuming when the number where permi is also determined by hi (e) (by taking a few bits of events is large. Consider this general problem that is of from it). Suppose bit g maps to bit g with this permutation. independent interest: Then we set bit g of bm[i, j]. This is done for each of the k Group Counting Problem: There are a large num- hash functions (k rows). ber of elements, and a set of m groups. There The query/lookup function on a BBS also first applies each are continuous, possibly duplicate input pairs of of the k hash functions and locates cell (i, j) and bm[i, j] in (element, group) information. Use a succinct data the same way. Then it applies the reverse of the permutation structure to store this information. A query is of the perm−1i on bm[i, j] (for each 1 ≤ i ≤ k). This gets the following form: given an element, return an estimate original bitmaps back. Then we do a bitwise AND over the of how many groups this element belongs to, and k recovered bitmaps to estimate the group membership and possibly which groups. group count. The bit permutations in BBS operations are for balancing the collision probabilities at each bit position (robust When m = 1, this problem is exactly the commonly known set to group size skewness) to accurately estimate group counts, membership problem, and can be approximately solved by a which is detailed later. We show these two algorithms below. hash table, or in particular, a Bloom filter . Thus, the Group Counting problem is a generalization of the set membership problem. Algorithm 3: BBS-S ET E LEMENT G ROUP(bbs, e, g) The Group Counting problem is part of B UILD R EGULAR - Input: bbs: a blended bitmap set, e : element, M ODEL for counting the frequency of event pairs (lines 15-16 g : group id of B UILD R EGULAR M ODEL). Specifically, each distinct event Output: updated bbs, group size sum sum0 pair (v1 , v2 ) is an element, and each sequence in S is a group 1 for each i ← 1 . . . k do (hence there are m groups). Given an event pair element, we 2 ri ← hi (e) need to query how many groups (sequences) it is in (and 3 use bits from ri to determine column id j ∈ 1 . . . w compare this number with τ · m). Therefore, we discuss our 4 use bits from ri to determine a random permutation space-efficient approximate solution to the Group Counting permi problem. 5 g ← permi (g) 6 if bit g of the bitmap bm[i, j] in cell c[i, j] = 0 then We devise a novel approximation technique, which we 7 set that bit to 1 call a blended bitmap set (BBS), to solve this problem. A BBS consists of k rows and w columns of bitmaps, while 8 if line 7 is performed at least once then each bitmap has m bits. The choice of k and w is based on 9 sum0 ← sum0 + 1 memory constraint and will be discussed shortly. We choose k hash functions hi (1 ≤ i ≤ k) randomly from a universal hash function family . Each hi corresponds to a row of w BBS-S ET E LEMENT G ROUP is exactly as discussed above. bitmaps. Fig. 3 shows an example BBS structure. The ri in line 2 is the (random) hash value from the ith hash function. One detail we have not discussed is the permutation The basic idea is as follows. Whenever an element-group permi in line 4. A fully uniformly random permutation over pair (e, g) arrives (meaning element e is in group g), we apply m bits requires log m! bits to describe. This is too many bits to 935

7.take from ri . Moreover, doing permi and perm−1 i would be 0.14 rather costly. As mentioned earlier, the goal of permutation is 0.12 to have an accurate estimation of group count (detailed later). 0.1 We must balance between feasibility and accuracy. Thus, we 0.08 partition a bitmap into a constant number b of blocks, and pi 0.06 permute the blocks uniformly at random. This only requires 0.04 log b! bits from ri , and is much more efficient. There are standard algorithms to do the random permutation of b blocks, 0.02 such as the Knuth shuffle  with a cost O(b). 0 0 5 10 15 20 i Line 5 is to apply this permutation function to bit position g, and g is the position it is permuted to (along with its Fig. 4: The convergence of the pi series block). We set that bit (line 7). Lines 8-9 are to maintain a counter sum0 , which is the total (estimated) number of distinct elements currently in the BBS — it can be an under- Proof: Element e maps to k cells in BBS (one in each estimate due to collisions. This will be used in BBS-L OOK U P row) due to the hash functions. For this bit in gm (let it be for adjusting the group count estimate. bit x) to be a false positive, it must have been set to 1 in the bitmaps of all these k cells. Let us consider these k cells one at Algorithm 4: BBS-L OOK U P(bbs, e, sum0 ) a time, say cell (i, j). In order for bit x of bm[i, j] to be 1, there are two conditions: (1) with a random permutation determined Input: bbs: a blended bitmap set, e : element, by its hash value, another elements group ID landed on bit sum0 : current group size sum x; and (2) that element’s hi hash value happens to go to j. Output: bitmap of groups e is in, and estimated group count The probability of (2) is w1 , and due to random permutation, 1 gm ← bitmap of all 1’s // group membership of e in expectation, c distinct elements satisfy (1), where c is the 2 for each i ← 1 . . . k do average group size stated in the theorem, i.e., the average 3 ri ← hi (e) number of distinct elements in each group. Therefore, the 4 use bits from ri to determine column id j ∈ 1 . . . w probability that at least one of such collisions happens in cell 5 use bits from ri to determine a random permutation (i, j) is 1 − (1 − w1 )c . For bit x to be a false positive, this permi needs to be true for all k cells, giving the probability stated in 6 bm ← perm−1 i (bm[i, j]) the theorem. 7 gm ← gm&bm 8 c0 ← sum0 /m In line 8 of BBS-L OOK U P, c0 is an under-estimate of the 9 if c0 has changed significantly from previous call then average group size c due to collisions (false positives). 10 i ← 0; p0 ← 0 Theorem 2: The pi series in the loop at lines 10-14 will 11 repeat c0 converge. Moreover, the converged value p is an unbiased 12 pi+1 ← [1 − (1 − w1 ) 1−pi ]k estimate of the bit false positive probability in gm, and x−mp 1−p 13 i←i+1 is an unbiased estimate of the number of groups that e is in. 14 until pi converges; let the converged value be p 15 x ← number of 1’s in gm Proof: First, we claim that the pi series in line 12 16 return gm and x−mp monotonically increases. We prove this by induction on i. 1−p In the base case, when i = 0, clearly pi+1 > pi since p1 = [1−(1− w1 )c0 ]k > 0 = p0 . For the induction step, suppose Line 1 of BBS-L OOK U P initializes a bitmap gm to be all pi > pi−1 is true. Thenc we show that pi+1 > pi . This is true 0 1’s for its m bits. Lines 2-4 are as before to apply each of the k because [1 − (1 − w1 ) 1−x ]k increases when x increases from hash functions and get to cell (i, j) of BBS. Then line 6 applies pi−1 to pi . Since the pi series monotonically increases, and the reverse permutation to bm[i, j], i.e., the reverse operations since it has an upper bound 1, from the Monotone Convergence of the b block permutation for BBS-S ET E LEMENT G ROUP. Let c0 Theorem , it must converge. Furthermore, since 1−p is an the recovered bitmap be bm (line 6). Note that only the same unbiased estimate of c, the converged value p must also be element e will use the same permi as it is determined by bits an unbiased estimate of the bit false positive probability from of hi (e). Line 7 is the bitwise AND operation. Thus, after the Theorem 1. Finally, if the true total number of groups that e loop, gm ends up being the bitwise AND of the k original is in is y, then x = y + (m − y)p in expectation, which gives bitmaps (before permutation) corresponding to e. Clearly, the y = x−mp 1−p as an unbiased estimation. bits corresponding to all the groups that element e is in must be 1 in gm. Fig. 4 plots the convergence of the pi series with (arbitrary) parameters k = 3, w = 500, and c0 = 300. Finally, let us At this point, although gm contains the estimated group discuss the optimal choice of parameters k and w. Suppose membership of element e, the total group count can be an we are given a space budget. Since each cell of BBS contains overestimate due to collisions on other bits. a bitmap of the same size, it is equivalent to having an upper bound on k · w. Let this limit be k · w = M . Theorem 1: After the loop in lines 2-7 of BBS-L OOK U P, any bit in gm that should be 0 (for a group that e is not in) Theorem 3: Given the constraint k · w = M , the choice has an estimated false positive probability of [1 − (1 − w1 )c ]k , of k and w that minimizes the bit false positive probability in where c is the average group size. Theorem 1 is k = M c c ln 2 and w = ln 2 . 936