A Primitive Operator for Similarity Joins in Data Cleaning

Data cleaning based on similarities involves identification of “close” tuples, where closeness is evaluated usingavariety of similarity functions chosen to suit the domain and application. Current approaches for efficiently implementing such similarity joins are tightly tied to the chosen similarity function. In this paper, we propose a new primitive operator which can be used as a foundation to implement similarity joins according to a variety of popular string similarity functions, and notions of similarity which go beyond textual similarity. We then propose efficient implementations for this operator. In an experimental evaluation using real data sets,
展开查看详情

1. A Primitive Operator for Similarity Joins in Data Cleaning Surajit Chaudhuri Venkatesh Ganti Raghav Kaushik Microsoft Research {surajitc,vganti,skaushi}@microsoft.com Abstract values to join data across relations, e.g., similarities in part descriptions in the above example. A variety of string simi- Data cleaning based on similarities involves identifica- larity functions have been considered, such as edit distance, tion of “close” tuples, where closeness is evaluated using a jaccard similarity, cosine similarity and generalized edit dis- variety of similarity functions chosen to suit the domain and tance ([4]), for measuring similarities. However, no single application. Current approaches for efficiently implement- string similarity function is known to be the overall best ing such similarity joins are tightly tied to the chosen sim- similarity function, and the choice usually depends on the ilarity function. In this paper, we propose a new primitive application domain [10, 13] (see also Section 6). For exam- operator which can be used as a foundation to implement ple, the characteristics of an effective similarity function for similarity joins according to a variety of popular string sim- matching products based on their part names where the er- ilarity functions, and notions of similarity which go beyond rors are usually spelling errors would be different from those textual similarity. We then propose efficient implementations matching street addresses because even small differences in for this operator. In an experimental evaluation using real the street numbers such as “148th Ave” and “147th Ave” datasets, we show that the implementation of similarity joins are crucial, and the soundex function for matching person using our operator is comparable to, and often substantially names. better than, previous customized implementations for partic- The similarity join of two relations R and S both con- ular similarity functions. taining a column A is the join R θ S where the join pred- icate θ is f (R.A, S.A) > α, for a given similarity func- tion f and a threshold α. Although similarity joins may 1. Introduction be expressed in SQL by defining join predicates through user-defined functions (UDFs), the evaluation would be very Data cleaning is an essential step in populating and main- inefficient as database systems usually are forced to apply taining data warehouses and centralized data repositories. UDF-based join predicates only after performing a cross A very important data cleaning operation is that of “join- product. Consequently, specialized techniques have been ing” similar data. For example, consider a sales data ware- developed to efficiently compute similarity joins. However, house. Owing to various errors in the data due to typing these methods are all customized to particular similarity mistakes, differences in conventions, etc., product names functions (e.g., [1, 8, 9]). and customer names in sales records may not match exactly A general purpose data cleaning platform, which has to with master product catalog and reference customer regis- efficiently support similarity joins with respect to a variety tration records respectively. In these situations, it would be of similarity functions is faced with the impractical option desirable to perform similarity joins. For instance, we may of implementing and maintaining efficient techniques for a join two products (respectively, customers) if the similar- number of similarity functions, or the challenging option of ity between their part descriptions (respectively, customer supporting a foundational primitive which can be used as a names and addresses) is high. This problem of joining sim- building block to implement a broad variety of notions of ilar data has been studied in the context of record linkage similarity. (e.g. [6, 7]), of identifying approximate duplicate entities in In this paper, we propose the SSJoin operator as a foun- databases (e.g., [5, 9, 11]). It is also relevant when identi- dational primitive and show that it can be used for sup- fying for a given record the best few approximate matches porting similarity joins based on several string similarity from among a reference set of records [4]. The similarity functions—e.g., edit similarity, jaccard similarity, general- join is the fundamental operation upon which many of these ized edit similarity, hamming distance, soundex, etc.—as techniques are built. well as similarity based on cooccurrences [1]. In defining Current approaches exploit similarity between attribute the SSJoin operator, we exploit the observation that set Proceedings of the 22nd International Conference on Data Engineering (ICDE’06) 8-7695-2570-9/06 $20.00 © 2006 IEEE

2. overlap can be used effectively to support a variety of simi- 2. Similarity Based on Set Overlap larity functions [13]. The SSJoin operator compares values based on “sets” associated with (or explicitly constructed In this section, we formally define the SSJoin operator for) each one of them. As we will show later, the design by considering a simple notion of string similarity by map- and implementation of this logical operator leverages the ping the strings to sets and measuring their similarity using existing set of relational operators, and helps define a rich set overlap. We then define the SSJoin operator that can be space of alternatives for optimizing queries involving simi- used to evaluate this notion of set overlap similarity. larity joins. There are several well-known methods of mapping a string to a set, such as the set of words partitioned by delim- The SSJoin—denoting set similarity join—operator ap- iters, the set of all substrings of length q, i.e., its constituent plies on two relations R and S both containing columns q-grams, etc. For example, the string “Microsoft Corpo- A and B. A group of R.B values in tuples sharing the ration” could be treated as a set of words {’Microsoft’, same R.A value constitutes the set corresponding to the ’Corp’}, or as a set of 3-grams, {‘Mic’, ‘icr’, ‘cro’, ‘ros’, R.A value. The SSJoin operator returns pairs of dis- ‘oso’, ‘sof’, ‘oft’, ‘ft ’, ‘t C’, ‘ Co’, ‘Cor’, ‘orp’}. Hence- tinct values R.A, S.A if the overlap of the corresponding forth, we refer to the set corresponding to a string σ as groups of R[B] and S[B] values is above a user specified S et(σ). This set could be obtained by any of the above threshold. We allow both weighted and unweighted ver- methods. In this paper, we focus on multi-sets. Whenever sions. As an example, consider two relations R[state, city] we refer to sets, we mean multi-sets. Hence, when we refer and S[state, city]. Setting A = state and B = city, the to the union and intersection of sets, we mean the multi-set SSJoin operator returns pairs of R.state, S.state values union and multi-set intersection respectively. if the overlap between sets of cities which occur with each In general, elements may be associated with weights. state is more than a threshold. So, it may return the pairs This is intended to capture the intuition that different por- (‘washington’, ‘wa’) and (‘wisconsin’, ‘wi’) because the tions of a string have different importance. For example, sets of cities within these groups overlap significantly. We in the string “Microsoft Corp”, we may want to associate will show in Section 3 that similarity joins based on a variety more importance to the portion “Microsoft”. There are well- of similarity functions can be cast into a setting leveraging known methods of associating weights to the set elements, the SSJoin operator. such as the notion of Inverse Document Frequency (IDF) commonly used in Information Retrieval. We assume that We then develop efficient implementations for the the weight associated with an element of a set, such as a SSJoin operator. We first show that the SSJoin operator word or q-gram, is fixed and that it is positive. Formally, can be implemented in SQL using equi-joins. We further all sets are assumed to be drawn from a universe U. Each optimize the implementation for scenarios where the over- distinct value in U is associated with a unique weight. The lap has to be high based on the intuition that a high overlap weight of a set s is defined to be the sum of the weights of between two sets implies that smaller subsets of the two sets its members and is denoted as w t(s). Henceforth, in this also overlap. For example, if the overlap between two sets paper, we talk about weighted sets, noting that in the spe- with 5 elements each has to be greater than 4, then size-2 cial case when all weights are equal to 1, we reduce to the subsets have a non-zero overlap. Based on this observation, unweighted case. we significantly reduce the number of candidate R.A, S.A groups to be compared. We observe that this implementa- Given two sets s1 , s2 , we define their overlap similar- tion can also be carried out using traditional relational oper- ity, denoted Overlap(s1 , s2 ), to be the weight of their in- ators plus the groupwise processing operator [3], making it tersection, i.e., w t(s1 ∩ s2 ). The overlap similarity be- much easier to integrate with a relational engine. Not sur- tween two strings, σ1 , σ2 , Overlap(σ1 , σ2 ) is defined as prisingly, our proposed techniques are significantly better Overlap(S et(σ1 ), S et(σ2 )). than using UDFs to compute similarities, which usually re- Given relations R and S, each with string valued attribute sults in plans based on cross products. A, consider the similarity join between R and S that returns all pairs of tuples where the overlap similarity between R.A The rest of the paper is organized as follows. In Sec- and S.A is above a certain threshold. We expect that when tion 2, we define the SSJoin operator. In Section 3, we two strings are almost equal, their overlap similarity is high, instantiate similarity joins based on a variety of similarity and hence this is a natural similarity join predicate to ex- functions. In Section 4, we describe an efficient physical press. We next introduce the SSJoin operator that can be implementation for the SSJoin operator. In Section 5, we used to express this predicate. show using several real datasets that our physical implemen- In this paper, we assume the standard relational data tations are efficient, and sometimes substantially better than model for simplicity. But, our techniques are also applica- custom implementations. We discuss related work in Sec- ble to other models which allow inline representation of set- tion 6, and conclude in Section 7. valued attributes. We assume that all relations are in the First Proceedings of the 22nd International Conference on Data Engineering (ICDE’06) 8-7695-2570-9/06 $20.00 © 2006 IEEE

3. Normal Form, and do not contain set-valued attributes. Sets to denote the following result: { ar , as ∈ R.A × and hence the association between a string and its set are S.A|pred(ar , as ) is true }. also represented in a normalized manner. For example, the We also write pred as {OverlapB (ar , as ) ≥ ei }. set of rows in relation R of Figure 1 represents the associ- ation between the string “Microsoft Corp” and its 3-grams; We illustrate this through the following examples based the third norm column denotes the length of the string. on Figure 1. The third column Norm denotes the length of the string. In general, the norm denotes either the length OrgName 3-gram Norm OrgName 3-gram Norm of the string, or the cardinality of the set, or the sum of the Microsoft Corp mic 12 Mcrosoft Corp mcr 11 weights of all elements in the set. Several similarity func- Microsoft Corp icr 12 Mcrosoft Corp cro 11 Microsoft Corp cro 12 Mcrosoft Corp ros 11 tions use the norm to normalize the similarity. … … … … … … Example 2 : As shown in Figure 1, let R(OrgName, 3-gram, N orm) Microsoft Corp cor 12 Mcrosoft Corp cor 11 Microsoft Corp orp 12 Mcrosoft Corp orp 11 relations and R S S(OrgName, 3, N orm) associate the organization names with (1) all 3-grams in each organization name, and Figure 1. Example sets from strings (2) the number of 3-grams for each name. The predicate in the SSJoin operator may be instantiated in one of the We describe the SSJoin operator. Consider relations following ways to derive different notions of similarity. R(A, B) and S(A, B) where A and B are subsets of columns. Each distinct value ar ∈ R.A defines a group, • Absolute overlap: OverlapB (ar , as ) ≥ 10 joins the which is the subset of tuples in R where R.A = ar . pair of strings “Microsoft Corp”, “Mcrosoft Corp” Call this set of tuples S et(ar ). Similarly, each distinct since the overlap between the corresponding sets of 3- value as ∈ S.A defines a set S et(as ). The simplest grams is 10. form of the SSJoin operator joins a pair of distinct val- ues ar , as , ar ∈ R.A and as ∈ S.A, if the pro- • 1-sided normalized overlap: jections on column B of the sets S et(ar ) and S et(as ) OverlapB ( a, norm r , a, norm s ) ≥ 0.8 · R.norm have a high overlap similarity. The formal predicate is joins the pair of strings “Microsoft Corp”, “Mcrosoft Overlap(πB (S et(ar ), πB (S et(as ))) ≥ α for some thresh- Corp” since the overlap between the corresponding old α. We denote Overlap(πB (S et(ar ), πB (S et(as ))) sets of 3-grams is 10, which is more than 80% of 12. as OverlapB (ar , as ). Hence, the formal predicate is • 2-sided normalized overlap: OverlapB (ar , as ) ≥ α. We illustrate this through an ex- OverlapB ( a, norm r , a, norm s ) ≥ {0.8 · ample. R.norm, 0.8 · S.norm} also returns the pair of strings Example 1 : Let relation R(OrgName, 3-gram) and “Microsoft Corp”, “Mcrosoft Corp” since 10 is S(OrgName, 3-gram) shown in Figure 1 associate the more than 80% of 12 and 80% of 11. strings “Microsoft Corp” and “Mcrosoft Corp” with In the next section, we show how the intuitive notion of their 3-grams. Denoting OrgName by A and 3- set overlap can be used to capture various string similarity gram by B, the SSJoin operator with the predicate functions. We discuss the implementation of the SSJoin op- OverlapB (ar , as ) ≥ 10 returns the pair of strings erator in Section 4. “Microsoft Corp”, “Mcrosoft Corp” since the overlap be- tween the corresponding sets of 3-grams is 10. 3. Using SSJoin Operator for Similarity Joins In general, we may wish to express conditions such as: the overlap similarity between the two sets must be 80% of In this section, we show illustrate the usage of the the set size. Thus, in the above example, we may wish to SSJoin operator to implement similarity joins based on a assert that the overlap similarity must be higher than 80% variety of previously proposed string similarity functions. of the number of 3-grams in the string “Microsoft Corp”. Earlier techniques relied on distinct specialized implemen- We may also wish to be able to assert that the overlap sim- tations for each similarity function. In contrast, our ap- ilarity be higher than say 80% of the sizes of both sets. We proach relies on the SSJoin operator to perform bulk of the now formally define the SSJoin operator as follows, which effort. Only a few checks have to be performed on the result addresses these requirements. of the SSJoin operator. Both the coding effort for program- ming these checks and the additional number of such checks Definition 1: Consider relations R(A, B) and S(A, B). is very small. Let pred be the predicate i {OverlapB (ar , as ) ≥ ei }, In this section, without loss of generality and for clar- where each ei is an expression involving only constants and ity in description, we fix unary relations Rbase(A) and columns from either R.A or S.A. We write R SSJoinpredA S Sbase(A) where A is a string-valued attribute. The goal Proceedings of the 22nd International Conference on Data Engineering (ICDE’06) 8-7695-2570-9/06 $20.00 © 2006 IEEE

4. UDF check for string similarity ED (R.A, S.A) ” Į SSjoin R SSjoinAS pred = (OverlapB(ar,as) • (|R.A| - 1 – (Į -1)q) R (A, B, norm(A)) S (A, B, norm(A)) String to set String to set R[A,B,norm(A)] S[A,B,norm(A)] Rbase (A: String) Sbase (A: String) Construct q-gram sets Construct q-gram sets Figure 2. String Similarity Join using SSJoin Figure 3. Edit distance join is to find pairs Rbase.A, Sbase.A where the textual sim- tance is less than an input threshold α. This implementation ilarity is above a threshold α. Our approach (outlined can be easily extended to edit similarity joins. in Figure 2) is to first convert the strings Rbase(A) and We illustrate the connection between edit distance and Sbase(A) to sets, construct normalized representations overlap through the following example. R(A, B, norm(A)) and S(A, B, norm(A)), and then suit- Definition 3: Consider the strings “Microsoft Corp” and ably invoke the SSJoin operator on the normalized repre- “Mcrosoft Corp”. The edit distance between the two is 1 sentations. The invocation is chosen so that all string pairs (deleting ’i’). The overlap similarity between their 3-grams whose similarity is greater than α are guaranteed to be in the is 10, more than 80% of the number of 3-grams in either result of the SSJoin operator. Hence, the SSJoin operator string. provides a way to efficiently produce a small superset of the correct answer. We then compare the pairs of strings us- The intuition is all q-grams that are “far away” from the ing the actual similarity function, declared as a UDF within place where the edits take place must be identical. Hence, if a database system, to ensure that we only return pairs of the edit distance is small, then the overlap on q-grams must strings whose similarity is above α. be high. The authors of [9] formalize this intuitive relation- Note that a direct implementation of the UDF within a ship between edit distance and the set of q-grams: database system is most likely to lead to a cross-product Property 4: [9] Consider strings σ1 and σ2 , of lengths where the UDF is evaluated for all pairs of tuples. On the |σ1 | and |σ2 |, respectively. Let QGSetq (σ) denote other hand, an implementation using SSJoin exploits the the set of all contiguous q-grams of the string σ. If support within database systems for equi-joins to result in a σ1 and σ2 are within an edit distance of , then significant reduction in the total number of string compar- Overlap(QGSetq (σ1 ), QGSetq (σ2 )) ≥ max(|σ1 |, |σ2 |)− isons. This results in orders of magnitude improvement in q+1− ·q performance, as we will discuss in Sections 4 and 5. Thus, in the above example, the edit distance is 1, and Prop- 3.1. Edit Distance erty 4 asserts that at least 9 3-grams have to be common. From the above property, we can implement the edit sim- The edit distance between strings is the least number of ilarity join through the operator tree shown in Figure 3. edit operations (insertion and deletion of characters, and We first construct the relations R(A, B, norm(A)) and substitution of a character with another) required to trans- S(A, B, norm(A)) containing the norms and q-gram sets form one string to the other. For example, the edit distance for each string. We then invoke the SSJoin operator over between strings ‘microsoft’ and ‘mcrosoft’ is 1, the number these relations in order to identify R.A, S.A pairs which of edits (deleting ‘i’) required to match the second string are very similar. Note that we further require a filter based with the first. The edit distance may be normalized to be on edit similarity (possibly as a user-defined function) in or- between 0 and 1 by the maximum of the two string lengths. der to filter out pairs whose overlap similarity is higher than Hence, the notion of edit similarity can also be defined as that given by Property 4 but edit similarity is still less than follows. the required threshold. Definition 2: Given two strings σ1 and σ2 , the edit distance 3.2. Jaccard Containment and Resemblance ED(σ1 , σ2 ) between them is the minimum number of edit We define the Jaccard containment and resemblance be- operations—insertion, deletion, and substitution—to trans- tween strings through the Jaccard containment and resem- form σ1 into σ2 . We define the edit similarity ES(σ1 , σ2 ) to ED(σ1 ,σ2 ) blance of their corresponding sets. We then illustrate the be 1.0 − max(|σ 1 |,|σ2 |) . use of the SSJoin operator for Jaccard containment using the following example. We consider the form of edit distance join addressed in [9], which returns all pairs of records where the edit dis- Definition 5: Let s1 and s2 be weighted sets. Proceedings of the 22nd International Conference on Data Engineering (ICDE’06) 8-7695-2570-9/06 $20.00 © 2006 IEEE

5. 1. The Jaccard containment of s1 in s2 , JC(s1 , s2 ) is de- sequence into another include insertion, deletion and re- fined to be w t(s 1 ∩s2 ) w t(s1 ) . placement of one token with another. Each edit opera- tion is associated with a cost dependent on the tokens (and 2. The Jaccard resemblance between s1 and s2 , their weights) involved in the edit. To illustrate, consider JR(s1 , s2 ), is defined to be w t(s1 ∩s2 ) w t(s1 ∪s2 ) . the above example strings. The strings “microsoft corp” and “microsft corporation” are close because ‘microsoft’ Example 3: Suppose we define the Jaccard containment and ‘microsft’ are close according to edit distance and the between two strings by using the underlying sets of 3- weights of ‘corp’ and ‘corporation’ are relatively small ow- grams. Consider strings σ1 = “Microsoft Corp” and σ2 = ing to their high frequency. GES has been shown to be very “Mcrosoft Corp”. We show how a Jaccard containment effective for matching erroneous tuples with their correct predicate on these strings translates to a SSJoin predicate. counterparts [4]. Let ed(σ1 , σ2 ) denote the absolute edit Suppose we want to join the two strings when the Jaccard distance normalized by the maximum of the strings lengths, containment of σ1 in σ2 is more than 0.8. ED(σ1 ,σ2 ) i.e., ed(σ1 , σ2 ) = max(|σ1 |,|σ2 |) . As shown in Figure 1, let R(OrgName, 3−gram, norm) and S(OrgName, 3−gram, norm) associate the strings Definition 6: Let σ1 and σ2 be two strings. The cost of “Microsoft Corp” and ”Mcrosoft Corp” with (1) the ac- transforming a token t1 in the set S et(σ1 ) of tokens corre- tual 3-grams in column 3 − gram, and (2) the number of 3- sponding to σ1 to a token t2 in S et(σ2 ) is ed(t1 , t2 )·w t(t1 ). grams in column norm. The cost of inserting or deleting a token t equals w t(t). The We can see that the Jaccard containment predi- cost tc(σ1 , σ2 ) of transforming σ1 to σ2 is the minimum cost cate is equivalent to the following SSJoin predicate: transformation sequence for transforming σ1 into σ2 . The OverlapB ( a, norm r , a, norm s ) ≥ 0.8 · R.norm. generalized edit similarity GES(σ1 , σ2 ) is defined as fol- lows. In general, we construct relations R A, B, norm(A) and S A, B, norm(A) from Rbase and Sbase respectively, tc(σ1 , σ2 ) GES(σ1 , σ2 ) = 1.0 − min( , 1.0) that associates a string with (1) the weight of the underly- w t(S et(σ1 )) ing set, and (2) the set of elements in its underlying set. The Jaccard containment condition can then be expressed We now illustrate the connection between GES and the using the operator tree shown in Figure 4. Note that because SSJoin predicate. Jaccard containment like the SSJoin operator measures the Example 4: Consider strings σ1 = “Microsoft Corp” degree of overlap, this translation does not require a post- and σ2 = “Mcrosoft Corp”. Consider the sets Set(σ1 ) = processing step. {Microsoft, Corp} and Set(σ2 ) = {Mcrosoft, Corp} ob- Observe that for any two sets s1 and s2 , tained using the tokenizing function and ignoring the se- JC(s1 , s2 ) ≥ JR(s1 , s2 ). Hence, JR(s1 , s2 ) ≥ α ⇒ quentiality among tokens. Suppose, we expand Set(σ1 ) Max(JC(s1 , s2 ), JC(s2 , s1 )) ≥ α. Therefore, as shown on to ExpandedSet(σ1 ) = {Microsoft, Mcrosoft, Macrosoft, the right hand side in Figure 4, we use the operator tree for Corp} by including tokens (say, from a dictionary) whose Jaccard containment and add the check for Jaccard resem- edit similarity with any token in Set(σ1 ) is high. Then, the blance as a post-processing filter. In fact, we check for the overlap between ExpandedSet(σ1 ) and Set(σ2 ) is high. Jaccard containment of JC(R.A, S.A) and JC(S.A, R.A) being greater than α. The above example illustrates the basic intuition. Infor- mally, the expansion adds to a set corresponding to R.A 3.3. Generalized Edit Similarity all tokens from a dictionary (say, all tokens in any attribute value of S.A) whose edit similarity with any token in the set This similarity function, introduced in [4] is a weighted is greater than a threshold β (< α). If the generalized edit variant of edit distance. The idea is to address some limita- similarity between the strings σ1 and σ2 is higher than α tions of plain edit distance, illustrated through the following then the overlap between their expanded sets must be higher example. Consider strings “microsoft corp”, “microsft cor- than α − β. The intuition is that the cost of transforming poration” and “mic corp”. The edit distance between “mi- any token t1 in Set(σ1 ) to a token t2 in Set(σ2 ) is either crosoft corp” and “mic corp” is less than that between “mi- (i) less than β if there is an overlapping token t between the crosoft corp” and “microsft corporation”. So is the case for expanded sets that is close to both t1 and t2 , or (ii) greater Jaccard similarity because it only matches tokens which are than β, otherwise. Therefore, the similarity is bounded by identical. α − β if the overlap is greater than α. In general, we can To deal with these limitations, the generalized edit sim- expand both sets Set(σ1 ) and Set(σ2 ) by including similar ilarity (GES) function was proposed in [4]. Each string tokens. The details are intricate and require a generalization is interpreted as a sequence of tokens, through some tok- of our element weight model to allow an element having dif- enizing function. The edit operations that transform one ferent weights as opposed to a fixed weight. In the interest Proceedings of the 22nd International Conference on Data Engineering (ICDE’06) 8-7695-2570-9/06 $20.00 © 2006 IEEE

6. JR(Set(R.A), Set(S.A)) • Į R SSjoinA S pred = OverlapB(<ar, norm(ar)>, <as, norm(as)> R SSjoinA S • Į Ânorm(ar) pred = OverlapB(<ar, norm(ar)>, <as, norm(as)> • {Į Ânorm(ar), Į Ânorm(as))} R[A, B, norm(A)] S[A, B, norm(A)] R[A, B, norm(A)] S[A, B, norm(A)] Construct sets for R.A Construct sets for S.A Construct sets for R.A Construct sets for S.A Figure 4. Jaccard containment and resemblance joins of space, we omit the details and note that all techniques described in this paper can be generalized appropriately. Author1 SSjoinName Author2 Pred: OverlapPTitle(namer, names) • Į norm(namer)) 3.4. Beyond Textual Similarity Author1 Author2 We now illustrate the applicability of the SSJoin oper- [Name, norm(name) PTitle] [Name, norm(name), PTitle] ator to similarity joins based upon non-textual notions of similarity. The first notion of similarity is based on that of Figure 5. Co-occurrence join using SSJoin “co-occurrence” between columns and the second is based on soft functional dependencies. We illustrate these obser- vations using examples. Both the following examples are FDs may not hold exactly for a variety of reasons: they may based on an example publication database involving tables not have been enforced due to efficiency reasons, or the rela- storing papers and authors. tion may be the union of relations from several independent sources. For example, a large percentage of emails (if they are valid) uniquely determine the author tuple. In gen- Using Co-occurrence eral, if we wish to use the functional dependency X → A Example 5: Suppose we have two tables, say from different to identify two similar values of R.A, then we can simply sources that are being integrated, of author names joined proceed by performing an equi-join on R.X. with the titles of the papers, say with the schema ptitle, The question arises how we can exploit multiple FDs. aname . Since we want a unified view of all authors, we Informally, two tuples agreeing on the source attributes of are interested in identifying author names that are likely to several FDs indicate that the target attribute values are the represent the same author. Now, if the naming conventions same. One natural way to aggregate the information from in the two sources are entirely different, it is quite likely that multiple functional dependencies is to use majority vote. We the textual similarity between the author names is only a formalize this as follows. Let {X1 , . . . , Xh , A} be a set of partial indicator of their similarity. We are forced to rely on columns in R and S. Each Xi is expected to functionally alternative sources of information for identifying duplicate determine A. author entities. Definition 7: For two tuples t1 and t2 in R, we write In this instance, we can use the set of paper titles as- k/h sociated with each author to identify authors. The idea is t1 ≈FD t2 if t1 and t2 agree on at least k out of the h that if two authors are the same, then the set of paper titles Xi . co-occurring with them must have a large overlap. We can Example 6 : Consider two relations Authors1, express this using Jaccard containment, for instance, which Authors2, both with the schema {name, address, translates directly into the SSJoin operator, as shown in city, state, zip, email, phone}. We may the operator tree in Figure 5. This notion of similarity has want to join two author names if at least two of the follow- been shown to be very effective for identifying approximate ing agree: address, email, phone. That would be duplicates [1]. 2/3 expressed as Author1 ≈FD Author2. Our next example illustrates how functional dependen- cies can be exploited for approximate equality. We illustrate how the SSJoin operator can be used to k/h compute the ≈FD predicate using the above example. By Using Soft Functional Dependencies associating each author name with a set of ordered pairs Column, Value and normalizing the resulting relation, Another source of identifying duplicate information is soft we get a relation with the schema Name, AEP (AEP for functional dependencies (FDs), which may not hold on the address-email-phone). We can implement the above predi- entire relation but over a large subset of the relation. The cate through the SSJoin operator as shown in Figure 6. Proceedings of the 22nd International Conference on Data Engineering (ICDE’06) 8-7695-2570-9/06 $20.00 © 2006 IEEE

7. Having  (gAR, gAS) > Group By (R.A,S.A) Author1 SSjoinName Author2 Pred: OverlapAEP(Author1.Name, Author2.Name) > T.S.A = S.A S[A,B,…] T.R.A = R.A Author1 Author2 T <Name, AEP> <Name, AEP> R[A,B,…] R.B=S.B prefix-filter(R,A,B,1- ) prefix-filter(S,A,B,1- ]) Figure 6. FD-based join using SSJoin R[A,B,…] S[A,B,…] Having  (gAR, gAS) > Group By (R.A,S.A) Figure 8. Prefix-filter implementation of SSJoin R.B=S.B R[A,B,…] S[A,B,…] represent tokens contained in strings. Certain tokens like “the” and “inc” can be extremely frequent in both R and Figure 7. Basic implementation of SSJoin S relations. In such scenarios, which occur often, the size of the equi-join on B is very large (refer Section 5). The 4. Implementation of SSJoin challenge, therefore, is to reduce the intermediate number of <R.A, S.A> groups compared. Next, we describe our approach to address this problem. In this section, we discuss the implementation of the SSJoin operator. We consider various strategies, each of which can be implemented using relational operators. The 4.2. Filtered SSJoin Implementation idea is to exploit the property that SSJoin has to only return pairs of groups whose similarity is above a cer- The intuition we exploit is that when two sets have a large tain threshold, and that thresholds are usually high. In overlap, even smaller subsets of the base sets overlap. To this section, we talk mostly about executing the operation make the intuition concrete, consider the case when all sets R SSJoinpred A S over relations R(A, B) and S(A, B) where are unweighted and have a fixed size h. We can observe the the predicate is OverlapB (ar , as ) ≥ α for some positive following property. constant α. The implementation extends to the case when Property 8: Let s1 and s2 be two sets of size h. Consider OverlapB (ar , as ) is required to be greater than a set of ex- any subset r1 of s1 of size h − k + 1. If |s1 ∩ s2 | ≥ k, then pressions. r1 ∩ s2 = φ. 4.1. Basic SSJoin Implementation For instance, consider the sets s1 ={1,2,3,4,5} and s2 ={1,2,3,4,6} which have an overlap of 4. Any subset of s1 Since α > 0, we can conclude that for a pair <ar , as > of size 2 has a non-zero overlap with the set s2 . Therefore, to be returned, at least one of the values in the column B instead of performing an equi-join on R and S, we may ig- related to ar and as must be the same. Indeed, by com- nore a large subset of S and perform the equi-join on R and puting an equi-join on the B column(s) between R and S a small filtered subset of S. By filtering out a large subset of and adding the weights of all joining values of B, we can S, we can reduce, often by very significant margins, the size compute the overlap between groups on R.A and S.A. Fig- of the resultant equi-join. We note here that this approach is ure 7 presents the operator tree for implementing the ba- similar to the OptMerge optimization in [13]. sic overlap-SSJoin. We first compute the equi-join be- The natural question now is whether or not we can apply tween R and S on the join condition R.B = S.B. Any such a prefix-filter to both relations R and S in the equi-join. < R.A, S.A > pair whose overlap is non-zero would be Interestingly, we find that the answer is in the affirmative. present in the result. Grouping the result on < R.A, S.A > We illustrate this as follows. Fix an ordering O of the uni- and ensuring, through the having clause, that the overlap is verse U from which all set elements are drawn. Define the greater than the specified threshold α would yield the result k-prefix of any set s to be the subset consisting of the first k of the SSJoin. elements as per the ordering O. Now, if |s1 ∩ s2 | ≥ k, then The size of the equi-join on B varies widely with the their (h − k + 1)-prefixes must intersect. For example, con- joint-frequency distribution of B. Consider the case when sider s1 = {1, 2, 3, 4, 5} and s2 = {1, 2, 3, 4, 6} as before. the SSJoin operator is used to implement the Jaccard sim- Assume the usual ordering of natural numbers. Since the ilarity between strings. Here, the values in the attribute B overlap between s1 and s2 is 4, their size (5 − 4 + 1) = 2- Proceedings of the 22nd International Conference on Data Engineering (ICDE’06) 8-7695-2570-9/06 $20.00 © 2006 IEEE

8. Overlap(R.Set(A), S.Set(A)) > pression of the form α · R.N orm, then we extract a βar ,norm(ar ) = (w t(S et(ar )) − α · norm(ar )) pre- R.B=S.B fix of the set S et(ar ). This generalizes to the case prefix-filter(R,[A,norm(A)],B,1- ) prefix-filter(S,[A, norm(A)],B,1- ]) when we have an expression involving constants and R.N orm. R[A, B, norm(A)] S[A, B, norm(A)] • For a 2-sided normalized overlap predicate OverlapB (ar , as ) ≥ α · M ax(R.N orm, S.N orm), Figure 9. Prefix-filter with inline set represen- we apply different prefix-filter to relations R and S. tation We apply the filter prefix-filter(R, α · R.N orm) to R and prefix-filter(S, α · S .N orm) to S. • For the evaluation of a 1-sided normalized overlap prefixes must intersect, which is the case — the size-2 pre- predicate OverlapB (ar , as ) ≥ α · R.N orm, we can fixes of both s1 and s2 is {1, 2}. Therefore, an equi-join on apply the prefix-filter only on sets in R. B on the filtered relations will return all pairs that satisfy the SSJoin predicate. The result would be a superset of all pairs of < R.A, S.A > groups with overlap greater than the 4.3. Implementation Issues given threshold. And, the number of candidate groups of We now discuss the implementation issues around the pairs is significantly (sometimes, by orders of magnitude) prefix-filter approach. smaller than the number of pairs from the equi-join on the full base relations (refer Section 5). 4.3.1 Mapping Multi-set Intersection to Joins This intuition can be extended to weighted sets. Consider any fixed ordering O of the domain from which R.B and Observe that the form of predicate we consider here involves S.B are drawn. Given a weighted set r drawn from this multi-set intersection when any R.A (or S.A) group con- domain, define prefix β (r) to be the subset corresponding to tains multiple values on the R.B attributes. In order to the shortest prefix (in sorted order), the weights of whose be able to implement them using standard relational oper- elements add up to more than β. We have the following ators, we convert these multi-sets into sets; we convert each result: value in R.B and S.B into an ordered pair containing an ordinal number to distinguish it from its duplicates. Thus, Lemma 1: Consider two weighted sets s1 and s2 , such that for example, the multi-set {1, 1, 2} would be converted to w t(s1 ∩s2 ) ≥ α. Let β1 = w t(s1 )−α and β2 = w t(s2 )−α. { 1, 1 , 1, 2 , 2, 1 }. Since set intersections can be imple- Then prefix β1 (s1 ) ∩ prefix β2 (s2 ) = φ. mented using joins, the conversion enables us to perform Suppose that for the set defined by value ar ∈ R.A, multi-set intersections using joins. S et(ar ) (respectively for as ∈ S.A), we extract a βar = (w t(S et(ar )) − α) prefix under O (respectively, a βas pre- 4.3.2 Determining the Ordering fix). From the above lemma, performing the equi-join B on Note that the prefix-filter is applicable no matter what or- the resulting relations will result in a superset of the result dering O we pick. The question arises whether the ordering of the SSJoin. We can then check the SSJoin predicate picked can have performance implications. Clearly, the an- on the pairs returned. Since the filter is letting only a prefix swer is that it does. Our goal is to pick an ordering that under a fixed order to pass through, we call this filter the minimizes the number of comparisons that the ordering will prefix-filter. We refer to the relation obtained by filtering R imply. One natural candidate here is to order the elements as prefix-filter(R, α). by increasing order of their frequency in the database. This The filtered overlap implementation of the SSJoin oper- way, we try to eliminate higher frequency elements from the ator is illustrated in Figure 8. We first join the prefix-filtered prefix filtering and thereby expect to minimize the number relations to obtain candidate pairs R.A, S.A groups to be of comparisons. Since many common notions of weights compared. We join the candidate set of pairs with the base (e.g., IDF) are inversely proportional to frequency, we can relations R and S in order to obtain the groups so that we can implement this using the element weights. Several opti- compute the overlap between the groups. The actual com- mization issues arise such as to what extent will prefix- putation of the overlap is done by grouping on R.A, S.A filtering help, whether it is worth the cost of producing the and filtering out groups whose overlap is less than α. filtered relations, whether we should proceed by partition- We need to extend this implementation to address the fol- ing the relations and using different approaches for different lowing issues. partitions, etc. • Normalized Overlap Predicates: Instead of a con- In our implementation, we order R.B values with respect stant α as in the discussion above, if we have an ex- to their IDF weights. Since high frequency elements have Proceedings of the 22nd International Conference on Data Engineering (ICDE’06) 8-7695-2570-9/06 $20.00 © 2006 IEEE

9. lower weights, we filter them out first. Therefore, the size of EditSim(R.A, S.A) > Customized Edit Implementation 4000 EditSim-Filter the subset (and hence the subsequent join result) let through 3500 Candidate-enumeration Edit similarity specific 3000 would be the smallest under this ordering. Time units Prep filters 2500 2000 1500 R.B=S.B 1000 500 4.3.3 Implementing the prefix-filter 0 0.8 0.85 0.9 0.95 R[A,B,…] S[A,B,…] Threshold The prefix-filter can be implemented using a combination of standard relational operators such as group by, order by, and Figure 11. Customized edit similarity join join, and the notion of groupwise processing [2, 3] where we iteratively process groups of tuples (defined as in group-by, Threshold SSJoin Direct i.e., where every distinct value in a grouping column consti- 0.80 546492 28252476 0.85 129925 21405651 tutes a group) and apply a subquery on each group. In our 0.90 16191 13913492 case, we would group the tuples of R on R.A and the sub- 0.95 7772 5961246 query would compute the prefix of each group it processes. In our implementation, we use a server-side cursor which Table 1. #Edit comparisons requires the scan of the base relation R ordered on A, B. While scanning, we mark the prefix of each group Set(ar ). Observe that ordering R.B with respect to the fixed order O 5. Experiments of R.B may require an additional join of R with the “order” We now experimentally demonstrate the generality of table. the SSJoin operator in allowing several common similarity functions besides the efficiency of our physical implemen- 4.3.4 Inlined Representation of Groups tations. Datasets: All our experiments are performed using the Cus- A property of the prefix-filter approach is that when we ex- tomer relation from an operational data warehouse. We eval- tract the prefix-filtered relations, we lose the original groups. uate the SSJoin operator by implementing similarity joins Since the original groups are required for verifying the on a relation R of 25, 000 customer addresses with itself. SSJoin predicate, we have to perform a join with the base While performing Jaccard and generalized edit similarity relations again in order to retrieve the groups, as shown in joins between two relations R and S based on the attribute Figure 8. These joins can clearly add substantially to the values in A, we assign IDF weights to elements of sets (to- cost of the SSJoin operation. kens) as follows: log( |R|+|S| ft ), where ft is the total number Next, we discuss a new implementation which can avoid of R[A] and S[A] values, which contain t as a token. these joins. The idea is to “carry” the groups along with We implemented the SSJoin operator and the similarity each R.A and S.A value that pass through the prefix-filter. join operations as client applications over Microsoft SQL This way, we can avoid the joins with the base relations. Server 2005. We ran all our experiments on a desktop PC The intuition is illustrated in Figure 9. In order to do so, we running Windows XP SP2 over a 2.2 GHz Pentium 4 CPU either require the capability to define a set-valued attribute with 1GB main memory. or a method to encode sets as strings or clobs, say by con- catenating all elements together separating them by a special marker. 5.1. Evaluating the SSJoin Encapsulation In our implementation, we choose the latter option. Now, measuring the overlap between R.A, S.A groups can be We compare our implementations with the best known done without a join with the base relations. However, we re- customized similarity join algorithm for edit similarity [9]. quire a function, say a UDF, for measuring overlap between No specialized similarity join algorithms have been pro- inlined sets. This implementation goes beyond the capabil- posed for Jaccard resemblance and generalized edit simi- ities of standard SQL operators as it requires us to compute larity. Therefore, we consider the basic SSJoin implemen- set overlaps. However, the UDF we use is a simple unary tation as the strawman strategies for these similarity joins. operator that does not perform very sophisticated operations Previous implementations for the generalized edit similarity internally, especially when the sets are bounded. Our exper- were probabilistic [4]; hence, we do not compare our imple- iments show that this alternative is usually more efficient mentations with the earlier strategy. than the prefix-filtered implementation since it avoids the The customized algorithm for edit similarity join is sum- redundant joins. marized by the operator tree on the left hand side of Fig- We note here that our current implementation is not ure 11: an equi-join on R.B and S.B along with additional geared for the case of unbounded sets. Dealing with large filters (difference in lengths of strings has to be less, and the sets is left for future work. positions of at least one q-gram which is common to both Proceedings of the 22nd International Conference on Data Engineering (ICDE’06) 8-7695-2570-9/06 $20.00 © 2006 IEEE

10. Basic SSJoin Prefix-filtered SSJoin In-line representation 1800 2500 1000 1600 Prep SSJoin Filter Filter Prep Prefix-filter SSJoin Filters 1400 2000 SSJoin 800 Time units Time units 1200 Prefix-filter Time units 1500 600 1000 Prep 800 1000 600 400 400 500 200 200 0 0 0 0.8 0.85 0.9 0.95 0.8 0.85 0.9 0.95 Threshold Threshold 0.8 0.85Threhold 0.9 0.95 Figure 10. Edit similarity join: basic, prefix-filtered, inline-represented SSJoin strings has to be close) followed by an invocation of the edit based choices are not possible for the framework proposed similarity computation. by Sarawagi et al. [13]. Figure 11 plots the times required for implementing the The prefix-filtered implementation with the inline set edit similarity join using the customized algorithm. Com- representation is still more efficient than the basic strategy paring the times with those in Figure 10, we first note that even at lower thresholds. As expected, avoiding the over- the edit similarity joins based on our implementations (in- head of joins with base relations to regroup elements signif- cluding the basic implementation) are faster than that using icantly improves efficiency. the custom implementation. The reason is that the custom We also note that the plans chosen by the query optimizer implementation compares a very large number of strings. only involved hash and merge joins. For no instance did Table 1 plots the number of edit similarity computations the optimizer choose an index-based join even if we created through the SSJoin operator and that through the custom clustered indexes on temporary tables. Therefore, we con- implementation. The custom implementation performs a clude that a fixed index-based strategy for similarity joins much (by orders of magnitude) larger number of edit simi- as in [13] and [6] is unlikely to be optimal always. Instead, larity comparisons. The SSJoin operator implementations we must proceed with a cost-based choice that is sensitive rely on the overlap predicate in order to significantly reduce to the data characteristics. the number of edit similarity computations. Thus, using the Jaccard Resemblance: Figure 12 plots the times for imple- more general SSJoin operator, we are able to implement menting the Jaccard resemblance join at various thresholds the edit similarity join more efficiently than the previously through each implementation of the SSJoin operator. The known best customized solution. prefix-filtered implementation is 5-10 times faster than the basic implementation. Therefore, prefix-filtering is very ef- Edit Similarity join: Figure 10 plots the times required for fective in reducing the cost of the SSJoin operator. Also, implementing a similarity join based upon edit similarity at observe that the prefix-filtered implementation with inline various thresholds. The prefix-filtered implementations are representation is around 30% faster than the standard prefix- significantly faster than the basic implementation at higher filtered implementation. As expected, avoiding joins with thresholds (greater than or equal to 0.85). However, at lower the base relations just to gather together all elements of thresholds the basic implementation of the SSJoin operator groups to be compared is beneficial even if it means that is better than the prefix-filtered implementation using stan- additional information has to be carried through the prefix- dard SQL operators. At lower thresholds, the number of filter for each tuple. < R.A, S.A > group pairs to be compared is inherently Observe that most of the time in the basic implementa- high and any technique has to compare higher number of tion is spent in the execution of the SSJoin. The preparation groups. So, even prefix-filters have to let a larger number (denoted Prep in the figures) and the filtering (denoted Filter of tuples to pass through. Therefore, the effectiveness of the in the figures) take negligible fractions of the time. The time prefix filters decreases as we decrease the thresholds. There- taken for the prefix-filtered implementation increases as we fore, prefix filtering is not as effective at lower thresholds. decrease the threshold. Such behavior is expected because The additional cost of prefix filtering would not offset the the number of < R.A, S.A > pairs pruned away decreases savings resulting from reduced join costs. with the threshold. For the prefix-filtered implementations, That there is not always a clear winner between the basic a significant amount of time is spent in the prefix-filter due and prefix-filtered implementations motivates the require- to which the subsequent steps are very efficient. These ob- ment for a cost-based decision for choosing the appropriate servations are true for our implementations of edit similarity implementation. Because of our operator-centric approach, and of generalized edit similarity joins. our SSJoin operator can be enhanced with such rules and Generalized Edit Similarity: Figure 13 plots the times integrated with a query optimizer. Observe that such cost- required for implementing a similarity join based upon Proceedings of the 22nd International Conference on Data Engineering (ICDE’06) 8-7695-2570-9/06 $20.00 © 2006 IEEE

11. Basic SSJoin Prefix-Filtered SSJoin In-line representation 70 25 500 Prep SSJoin Filter 60 Prep Prefix-filter SSJoin Filter Prep Prefix-filter SSJoin Filters 400 20 Time units Time units 50 Time units 300 40 15 200 30 10 100 20 5 10 0 0 0 0.8 0.85 0.9 0.95 0.8 0.85 0.9 0.95 0.4 0.6 0.8 0.85 0.9 0.95 Threshold Threshold Threshold Figure 12. Jaccard similarity join: basic, prefix-filtered, and inline represented SSJoin GES Input Size SSJoin Input Output Time units 120 100 100K 288627 rows 2731 224 200K 778172 rows 2870 517 Time units 80 Basic 60 40 Prefix-filtered In-line 250K 1020197 rows 4807 649 20 330K 1305805 rows 3870 1072 0 0.8 0.85 0.9 0.95 Threshold Table 2. Varying input data sizes Figure 13. GES join time required by the prefix-filtered implementation for in- creasing data sizes. We also report the sizes of the tables in- the generalized edit similarity. The conclusions here are put to the SSJoin operator and the size of the output. As the similar to those for the Jaccard similarity. The prefix- input data size increases, the size of the prepared relations filtered implementations are better than those for the which are input to the SSJoin operator increases linearly. basic implementation by almost 2 times. Again, the inline The output size is a characteristic of the data and it can vary representation is better by about 25% than the prefix-filtered widely. The time required depends crucially on the output implementation using standard SQL operators. size besides the input relation size. For instance, adding a large number of very similar pairs would increase the output Beyond textual similarity: As discussed in Section 3, size as well as the time significantly. similarity notions based on agreements with respect to Summary: Based upon the above experiments implement- functional dependencies (say, at least k out of h FDs agree) ing jaccard similarity, edit similarity and generalized edit and that of co-occurrence between attribute values with similarity joins using the SSJoin operator, we conclude that respect to a different attribute can both be reduced to Jac- (i) the SSJoin operator is general enough to implement a card resemblance. The basic strategy for implementing the variety of similarity notions, both textual and non-textual, similarity based on agreements with respect to FDs using a (ii) our algorithms to implement of the SSJoin operator are disjunction of selection predicates results in a cross product efficient — for the edit similarity join, our implementation plan being chosen by the optimizer. We have already seen is more efficient than the best known customized implemen- that our physical implementations of the Jaccard resem- tation — and (iii) the choice of physical implementation of blance join using the SSJoin operator can be significantly the SSJoin operator must be cost-based. more efficient than the basic implementations and the cross product plans. Therefore, we do not discuss experiments based on these (non-textual) similarity functions. 6. Related Work Sarawagi et al. [13] recognized that set overlap is an im- Varying data sizes: Our implementations for the portant measure for dealing with a variety of similarity func- SSJoin operator rely primarily on the relational operators tions. The main difference in our approach is our operator- such as equi-join and group by. These operators have very centric focus. Sarawagi et al. [13] require plug-in functions efficient and scalable implementations within database sys- in order to implement each similarity function, whereas our tems. Hence, our implementations for the SSJoin operator approach is to compose the SSJoin operator with other op- are also scalable. erators. Our design choice leads to the possibility of making We perform a Jaccard similarity join of tables containing cost-based decisions in choosing a physical implementation addresses with themselves; we vary the number of rows in of similarity joins. For example, depending on the size of each table by picking a random subsets from the original re- the relations being joined and the availability of indexes, lation. We fix the threshold at 0.85. Table 2 presents the the optimizer may choose either index-based plans or merge Proceedings of the 22nd International Conference on Data Engineering (ICDE’06) 8-7695-2570-9/06 $20.00 © 2006 IEEE

12. and hash joins in order to implement the SSJoin operator. [2] D. Chatziantoniou and K. Ross. Querying multiple features This is in contrast to a fixed implementation based on in- in relational databases. In Proceedings of the VLDB Confer- verted indexes, as in [13]. Further, our implementation of ence, 1996. [3] D. Chatziantoniou and K. A. Ross. Groupwise processing of SSJoin operator is based upon relational operators present relational queries. In VLDB, pages 476–485, 1997. in a database system, making it much easier to integrate it [4] S. Chaudhuri, K. Ganjam, V. Ganti, and R. Motwani. Ro- into a database system. bust and efficient fuzzy match for online data cleaning. In Our notion of overlap similarity between groups is di- Proceedings of the ACM SIGMOD, June 2003. rectly related to the notion of similarity which measures co- [5] W. Cohen. Integration of heterogeneous databases without occurrence with respect to other attributes [1]. We build common domains using queries based on textual similarity. upon this notion of overlap and encapsulate it into a sim- In Proceedings of ACM SIGMOD, pages 201–212, Seattle, WA, June 1998. ilarity join operator. Second, we further observe that the [6] W. W. Cohen. Data integration using similarity joins and similarity join based on thresholded overlap similarity can a word-based information representation language. ACM be made into a primitive operator and applied to a variety of Transactions on information systems, 18(3):288–321, July other textual similarity functions. We also propose efficient 2000. algorithms for implementing this primitive. [7] I. P. Felligi and A. B. Sunter. A theory for record linkage. Custom join algorithms for particular similarity func- Journal of the American Statistical Society, 64:1183–1210, tions have been proposed for edit distance [9] and for co- 1969. [8] L. Gravano, P. Ipeirotis, N. Koudas, and D. Srivastava. Text sine similarity [8, 6]. Top-K queries over string similarity joins in an rdbms for web data integration. In In Proc. Intl. functions have also received significant attention in the con- world Wide Web Conference, pages 90–101, 2003. text of fuzzy matching where the goal is to match an in- [9] L. Gravano, P. G. Ipeirotis, H. V. Jagadish, N. Koudas, coming record against a reference table [6, 4]. However, S. Muthukrishnan, and D. Srivastava. Approximate string there is no work yet on the design of primitive operators for joins in a database (almost) for free. In Proceedings of top-K queries. However, we note that by composing the the 27th international conference on very large databases (VLDB), pages 491–500, Roma, Italy, September 11-14 SSJoin operator with the top-k operator, we can address 2001. the form of top-K queries which ask for the best matches [10] S. Guha, N. Koudas, A. Marathe, and D. Srivastava. Merging whose similarity is above a certain threshold. the results of approximate match operations. In Proceedings Set containment joins in object-relational systems may of the 30th international conference on very large databases also be used to express extreme forms of overlap queries (VLDB), pages 636–647, 2004. where the degree of overlap has to be 1.0 (e.g., [12]). How- [11] M. Hernandez and S. Stolfo. The merge/purge problem for ever, these techniques are not applicable for partial overlap large databases. In Proceedings of the ACM SIGMOD, pages 127–138, San Jose, CA, May 1995. queries, which is the focus of the SSJoin operator. Our [12] K. Ramasamy, J. M. Patel, J. F. Naughton, and R. Kaushik. techniques are also applicable for object-relational models Set containment joins: The good, the bad and the ugly. In which allow set-valued attributes. The prefix-filtered imple- VLDB, pages 351–362, 2000. mentation with inline representation for sets can be directly [13] S. Sarawagi and A. Kirpal. Efficient set joins on similarity implemented under these models. predicates. In Proceedings of the ACM SIGMOD, pages 743– 754, 2004. 7. Conclusions In this paper, we introduce a primitive operator SSJoin for performing similarity joins. We showed that similarity joins based on a variety of textual and non-textual similarity functions can be efficiently implemented using the SSJoin operator. We then developed very efficient physical implementations for this operator mostly using standard SQL operators. In future, we intend to integrate the SSJoin operator with the query optimizer in order to make cost-conscious choices among the basic, prefix-filtered, and inline prefix-filtered implementations. References [1] R. Ananthakrishna, S. Chaudhuri, and V. Ganti. Eliminat- ing fuzzy duplicates in data warehouses. In Proceedings of the 28th international conference on very large databases (VLDB), pages 586–597, Hong Kong, August 20-23 2002. Proceedings of the 22nd International Conference on Data Engineering (ICDE’06) 8-7695-2570-9/06 $20.00 © 2006 IEEE