- 快召唤伙伴们来围观吧
- 微博 QQ QQ空间 贴吧
- 文档嵌入链接
- 复制
- 微信扫一扫分享
- 已成功复制到剪贴板
Machine Learning for Online Query Relaxation
展开查看详情
1 . Machine Learning for Online Query Relaxation Ion Muslea SRI International 333 Ravenswood Menlo Park, CA 94025 ion.muslea@sri.com ABSTRACT 1. INTRODUCTION In this paper we provide a fast, data-driven solution to the The proliferation of online databases lead to an unprece- failing query problem: given a query that returns an empty dented wealth of information that is accessible via Web- answer, how can one relax the query’s constraints so that based interfaces. Unfortunately, exploiting Web-based infor- it returns a non-empty set of tuples? We introduce a novel mation sources is non-trivial because the user has only indi- algorithm, loqr, which is designed to relax queries that are rect access to the data: one cannot browse the whole target in the disjunctive normal form and contain a mixture of dis- database, but rather only the tuples that satisfy her queries. crete and continuous attributes. loqr discovers the implicit In this scenario, a common problem for the casual user is relationships that exist among the various domain attributes coping with failing queries, which do not return any answer. and then uses this knowledge to relax the constraints from Manually relaxing failing queries is a frustrating, tedious, the failing query. time-consuming task because, in the worst case, one must In a first step, loqr uses a small, randomly-chosen sub- consider an exponential number of possible relaxations (i.e., set of the target database to learn a set of decision rules various combinations of values for each possible subset of that predict whether an attribute’s value satisfies the con- attributes); furthermore, these difficulties are compounded straints in the failing query; this query-driven operation is by the fact that over-relaxing the query (i.e., weakening the performed online for each failing query. In the second step, constraints so that there are many tuples that satisfy them) loqr uses nearest-neighbor techniques to find the learned may lead to prohibitive costs in terms of bandwidth or fees rule that is the most similar to the failing query; then it to be paid per returned tuple. Researchers have proposed uses the attributes’ values from this rule to relax the fail- automated approaches to query relaxation [12, 9, 7], but ing query’s constraints. Our experiments on six application the existing algorithms have a major limitation: the knowl- domains show that loqr is both robust and fast: it suc- edge used in the query relaxation process is acquired offline, cessfully relaxes more than 95% of the failing queries, and independently of the failing query to be relaxed. it takes under a second for processing queries that consist We introduce here an online, query-guided algorithm for of up to 20 attributes (larger queries of up to 93 attributes relaxing failing queries that are in the disjunctive normal are processed in several seconds). form. Our novel algorithm, loqr1 , uses a small, randomly- chosen subset D of the target database2 to discover implicit Categories and Subject Descriptors relationships among the various domain attributes; then it uses this extracted domain knowledge to relax the failing I.2.6 [Artificial intelligence]: Learning query. Given the small dataset D and a failing query, loqr pro- General Terms ceeds as follows. In the first step, it uses D to learn decision rules that predict the value of each attribute from those of Algorithms, Experimentation the other ones. This learning step is on-line and query- guided: for each attribute Attr in the failing query, loqr Keywords uses the other attributes’ values to predict whether the value online query relaxation, failing query, rule learning, nearest of Attr satisfies the query’s constraints on it. Then, in a sec- neighbor, Web-based information sources ond step, loqr uses nearest-neighbor techniques to find the learned rule that is the most similar to the failing query. Finally, loqr relaxes the query’s constraints by using the attribute values from this “most similar” rule. For example, consider an illustrative laptop purchasing scenario in which the query Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are Q : P rice ≤ $2, 000 Display ≥ 17 W eight ≤ 3 lbs not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to 1 loqr stands for Learning for Online Query Relaxation republish, to post on servers or to redistribute to lists, requires prior specific 2 permission and/or a fee. As we will show in the empirical validation, as long as KDD’04, August 22–25, 2004, Seattle, Washington, USA. the examples in D are representative of those in the target Copyright 2004 ACM 1-58113-888-1/04/0008 ...$5.00. database T D , D and T D can be - in fact - disjoint.
2 .fails because large-screen laptops weigh more than three each presupposition against the database by converting the pounds. In the first step, loqr uses a small dataset of laptop subgraphs into sub-queries. configurations to learn decision rules such as The flex system [23] can be seen as generalizing the ideas in co-op. flex reaches a high tolerance to incorrect queries R: P rice ≤ $2, 300 Cpu < 2.1 GHz Display ≤ 13 by iteratively interpreting the query at lower levels of cor- ⇒ W eight ≤ 3 lbs rectness. flex is also cooperative in the sense that, for any failing query, it provides either an explanation for the failure The right-hand side (i.e., the consequent) of this rule con- or some assistance for turning the query into a non-failing sists of a constraint from the failing query, while the left- one. Given a failing query, flex generates a set of more gen- hand side specifies the conditions under which this con- eral queries that allow it to determine whether the query’s straint is satisfied in the dataset D. After learning such failure is genuine (in which case it suggest similar, but non- rules for each constraint in the query, loqr uses nearest- failing queries) or fake (in which case it detects the user’s neighbor techniques to find the learned rule that is the most erroneous presuppositions). similar to the failing query (for the time being, let us assume The main drawback of systems such as flex is their high that the rule R is this “most similar” rule). computational cost, which comes from computing and test- To relax the failing query, loqr “fuses” the constraints on the attributes that appear in both Q and R. In our example, ing a large number of presupposition (to identify the sig- the relaxed query nificant presupposition, a large number of queries must be evaluated on the entire database). In order to keep the run- QR : P rice ≤ $2, 300 W eight ≤ 3 lbs ning time acceptable, Motro [22] suggests heuristics for con- straining the search. A related approach is proposed by is obtained as follows: R’s constraint on CP U is ignored Gaasterland [12], who controls the query relaxation process because CP U does not appear in Q. Q’s constraint on the by using heuristics based on semantic query-optimization screen size is dropped because it conflicts with the one in techniques. Finally, on the theoretical side, Godfrey [14] R (Display cannot be simultaneously smaller that 13” and proves complexity results for finding all/some minimal fail- larger than 17”). Finally, Q’s price limit is increased to ing and maximal succeeding sub-queries: finding all of them the value in R, which is less constraining than the original is np-hard, while finding a subset takes polynomial time. amount of $2, 000. Note that this price increase is crucial for The CoBase system [8, 6, 9, 7] is the closest approach ensuring that QR does not fail: as long as the constraints in to our loqr algorithm. Central to the CoBase approach is QR , which are a subset of R’s, are at most as tight as those the concept of Type Abstraction Hierarchies (tahs), which in R, QR is guaranteed to match at least the tuples covered synthesize the database schema and tuples into a compact, by R (i.e., the examples from which R was learned). abstract form. In order to relax a failing query, CoBase In summary, the main contribution of this paper is a uses three types of tah-based operators: generalization, spe- novel, data-driven approach to query relaxation. Our on- cialization, and association; these operations correspond to line, query-guided algorithm uses a small dataset of domain moving up, down, or between the hierarchies, respectively. examples to extract implicit domain knowledge that is then CoBase automatically generates the tahs by clustering all used to relax the failing query. The rest of the paper is the tuples in the database [7, 21]. organized as follows: after discussing the related work, we CoBase is similar to loqr in the sense that it uses machine present a detailed illustrative example. Then we describe learning techniques for relaxing the queries. However, the the loqr algorithm and discuss its empirical validation. two approaches are radically different: in CoBase, the cpu- intensive clustering is performed only once, in an off-line 2. RELATED WORK manner, on all the tuples in the database. CoBase then uses this resulting tah to relax all failing queries. In contrast, Query modification has long been studied in both the fields loqr customizes the learned decision rules to each failing of databases and information retrieval [15, 18, 10, 13, 17, query by applying - online - the c4.5 learner [24] to a small, 4, 19, 2, 5, 3]. More recently, with the advent of XML, randomly-chosen subset of the database. researchers have proposed approaches for approximate pat- tern matching [25], answer ranking [11], and XML-oriented query relaxation [1, 16, 20]. 3. THE INTUITION co-op [18] is the first system to address the problem of Let us consider again the illustrative laptop purchasing empty answers (i.e., failing queries). co-op is based on the domain, in which the query idea of identifying the failing query’s erroneous presupposi- Q0 : P rice ≤ $2, 000 CP U ≥ 2.5 GHz tions, which are best introduced by example. Consider the query ‘‘get all people who make less than $15K and work in Display ≥ 17 W eight ≤ 3 lbs HDD ≥ 60GB the R & D department at the Audio, Inc. company’’. Note fails because of two independent reasons: that this query may fail for two different reasons: either no- body in Audio, Inc.’s R & D department makes less than $15K - laptops that have large screens (i.e., Display ≥ 17 ) weigh or the company does not have an R & D department. The for- more than three pounds; mer is a genuine null answer, while the latter is a fake empty - fast laptops with large hard disks (CP U ≥ 2.5GHz HDD ≥ answer that is due to the erroneous presuppositions that the 60GB ) cost more than $2,000. company has an R & D department. co-op, which focuses on finding the erroneous presuppositions, transforms the origi- In order to relax Q0 , loqr proceeds in three steps: first, it nal query into an intermediate, graph-oriented language in learns decision rules that express the implicit relationships which the connected sub-graphs represent the query’s pre- among the various domain attributes; then it uses nearest- suppositions; if the original query fails, then co-op tests neighbor techniques to identify the learned rule that is most
3 . Brand Price CPU HDD Weight Screen Brand Price CPU HDD Weight Screen Sony $2899 3.0 GHz 40 GB 3.9 lbs 18“ Sony $2899 YES 40 GB 3.9 lbs 18“ Dell $1999 1.6 GHz 80 GB 3.6 lbs 12“ Dell $1999 NO 80 GB 3.6 lbs 12“ ... ... ... ... ... ... ... ... ... ... ... ... Table 1: The original dataset D. Table 2: The newly created dataset D1 . similar to the failing query; finally, it uses the attribute val- as the statement “there are some examples in D that satisfy ues from this most-similar learned rule to relax the con- the condition straints from the failing query. Q1 : P rice ≤ $2, 900 Display ≥ 18 3.1 Step 1: Extracting domain knowledge W eight ≤ 4 lbs CP U ≥ 2.5 GHz In this step, loqr uses the small, randomly-chosen subset D of the target database to discover knowledge that can be Consequently, if we apply Q1 to D, Q1 is guaranteed not to used for query relaxation. loqr begins by considering Q0 ’s fail because it certainly matches the examples covered by R1 constraints independently of each other: for each constraint (i.e., the ones from which R1 was learned). Furthermore, as in Q0 (e.g., CP U ≥ 2.5 GHz), loqr uses D to find patterns D is a subset of the target database, it also follows that Q1 that predict whether this constraint is satisfied. Intuitively, is guaranteed not to fail on the target database. this corresponds to finding the “typical values” of the other In this second step, loqr converts all the learned rules attributes in the examples in which CP U ≥ 2.5 GHz. into the corresponding existential statements. Then it iden- For example, consider the dataset D that consists of the tifies the existential statement that is the “most useful” for laptop configurations from Table 1. In order to predict, for relaxing the failing query (i.e., the one that is the most sim- any laptop configuration, whether CP U ≥ 2.5 GHz is satis- ilar to Q0 ). This “most similar” statement is found by nearest-neighbor techniques. For example, the statement fied, loqr uses D to create the additional dataset D1 shown Q1 above is more similar to Q0 than in Table 2. Note that each example in D is duplicated in D1 , except for the value of its CP U attribute. The original Q2 : P rice ≤ $3, 000 Display ≥ 18 CP U attribute is replaced in a binary one (values YES or NO) that indicates whether CP U ≥ 2.5 GHz is satisfied W eight ≤ 4 lbs CP U ≥ 2.5 GHz by the original example from D. The new, binary CP U attribute is designated as the class attribute of D1 . because Q1 and Q2 differ only on their constraint on P rice, loqr extracts the potentially useful domain knowledge by and Q0 ’s P rice ≤ $2, 000 is more similar (i.e., closer in applying the c4.5 learner to the dataset D1 , thus learning value) to Q1 ’s P rice ≤ $2, 900 than to Q2 ’s P rice ≤ $3, 000. a set of decision rules such as Likewise, Q1 is more similar to Q0 than R1 : P rice ≤ $2, 900 Display ≥ 18 W eight ≤ 4 lbs Q3 : Brand == “Sony CP U ≥ 2.5 GHz ⇒ IsSatisf ied(CP U ≥ 2.5 GHz) == YES which shares only the CP U constraint with the failing query. R2 : P rice ≥ $3, 500 ⇒ IsSatisf ied(CP U ≥ 2.5 GHz) == YES 3.3 Step 3: Relaxing the failing query R3 : P rice ≤ $2, 000 HDD ≥ 60 W eight ≤ 4 lbs For convenience, let us assume that of all learned state- ments from the datasets D1 −D5 , Q1 is the one most similar ⇒ IsSatisf ied(CP U ≥ 2.5 GHz) == NO to Q0 . Then loqr creates a relaxed query Qr that con- tains only constraints on attributes that appear both in Q0 Such rules can be used for query relaxation because they and Q1 ; for each of these constraints, Qr uses the less con- describe sufficient conditions for satisfying a particular con- straining value of those in Q0 and Q1 . In our example, the straint from the failing query. In our example, the rules resulting relaxed query is above exploit the values of the other domain attributes to predict whether CP U ≥ 2.5 GHz is satisfied. Qr : P rice ≤ $2, 900 CP U ≥ 2.5 GHz Besides D1 , loqr also creates four other datasets D2 − Display ≥ 17 W eight ≤ 4 lbs D5 , which correspond to the constraints imposed by Q0 to the other domain attributes (i.e., Price, HDD, Weight, and which is obtained by dropping the original constraint on the Display). Each of these additional datasets is used to learn hard disk (since it appears only in Q0 ), keeping the con- decision rules that predict whether Q0 ’s constraints on the straint on CP U unchanged since (Q0 and Q1 have identical corresponding attributes are satisfied. constraints on CP U ), and setting the values in the con- Note that this learning process takes place online, for each straints on P rice, Display, and W eight to the least con- individual query; furthermore, the process is also query- straining ones (i.e., the values from Q1 , Q0 , and Q1 , respec- guided in the sense that each of the datasets D1 − D5 is tively). created at runtime by using the failing query’s constraints. The approach above has two advantages. First, as Q1 is This online, query-guided nature of the process is the key the statement the most similar to Q0 , loqr makes minimal feature that distinguishes loqr from existing approaches. changes to the original failing query. Second, as the con- 3.2 Step 2: Finding the “most useful” rule straints in Qr are a subset of those in Q1 , and they are at At this point, we must emphasize that the rule R1 , R2 , most as tight as those in Q1 (some of them may use the looser and R3 that were learned above can be seen as the existentially- values from Q0 ), it follows that all examples that satisfy Q1 quantified statements. For example, R1 can be interpreted also satisfy Qr . In turn, this implies that Qr is guaranteed
4 .not to fail on the target dataset because Q1 satisfies at least Given: - a failing query Q in Disjunctive Normal Form the examples covered by R1 .3 - a small, randomly-chosen subset D of the target database Let us now briefly consider a more complex example of query relaxation. Suppose that instead of Q1 , the existential RelaxedQuery = ∅ statement that is the most similar to Q0 is FOR EACH of Q’s failing conjunctions Ck DO - Step 1: Rules = ExtractDomainKnowledge(Ck , D) Q4 : P rice ≥ $2, 500 Display ≥ 18 - Step 2: Ref iner = FindMostSimilar(Ck , Rules) W eight ≥ 2.5 lbs CP U ≥ 2.5 GHz - Step 3: RelaxedConjunction = Refine(Ck , Ref iner) - add RelaxedConjunction to RelaxedQuery Note that for both P rice and W eight, the constraints in Q0 and Q4 use the “opposite” operators ≤ and ≥ (e.g., ≤ in Q0 and ≥ in Q4 , or vice versa). When relaxing Q0 based on Figure 1: loqr successively relaxes each conjunction the constraints in Q4 , loqr creates the relaxed query independently of the other ones. Qr : CP U ≥ 2.5 GHz Display ≥ 17 W eight ∈ [2.5, 3] Ck = Constr(Ai1 ) Constr(Ai2 ) ... Constr(Aik ). in which - the W eight is constrained to the values that are com- When the context is ambiguous, the notation ConstrCk (Aj ) mon to both Q0 and Q4 (i.e., W eight ≥ 2.5 lbs and is used to denote the constraint imposed by the conjunction W eight ≤ 3 lbs implies W eight ∈ [2.5, 3]); Ck on the attribute Aj . Each constraint consists of a domain attribute, an op- - the constraint on P rice is dropped because there are no erator, and one or several constants. For the discrete at- values of this attribute that can simultaneously satisfy tributes, loqr accepts constraints of the type =, =, ∈, or ∈ the corresponding constraints from Q0 and Q4 (i.e., (e.g., Color = black or M anuf acturer ∈ {Sony, HP }). For P rice ≤ $2, 000 and P rice ≥ $2, 500); the continuous attributes, the constraints use the inequality operators ≤, <, ≥, or > (e.g., P rice < 2000). - the other constraints are obtained exactly in the same way For a dnf query to fail, each of its conjunctions Ck must as they were computed for Qr . fail (i.e., the query consists of failing conjunctions only); 3.4 Beyond the illustrative example conversely, by successfully relaxing any of its failing con- junctions, one turns a failing query into a non-failing one. At this point we must make two important comments. Based on this observation, loqr successively relaxes the fail- First, the failing query Q0 is just a conjunction of sev- ing conjunctions independently of each other (see Figure eral constraints on the various attributes. In order to relax 1); consequently, without any loss of generality, we focus queries that are in disjunctive normal form (i.e., disjunctions here on the three steps used to relax a failing conjunction of conjunctions similar to Q0 − Q4 ), loqr simply considers Ck : extracting the implicit domain knowledge (expressed as the conjunctions independently of each other and applies the learned decision rules), finding the decision rule that is the three steps above to each individual conjunction. most similar to Ck , and using this decision rule to actually Second, the query-guided learning process above creates relax Ck . a dataset Di for each attribute that is constrained in the failing query (e.g., the five datasets D1 − D5 for the query 4.2 Step 1: Extracting domain knowledge Q0 ). This idea generalizes in a straightforward manner for In this first step, loqr uses a subset of the target database the scenario in which the user is allowed to specify “hard to uncover the implicit relationships that hold among the constraints” that should not be relaxed under any circum- domain attributes. This is done by the following strategy: stances: for each example in D, the entire set of hard con- for each attribute Aj that appears in the failing conjunction straints is replaced by a single binary attribute that specifies Ck , loqr uses the values of the other domain attributes to whether or not all the hard constraints are simultaneously predict whether Aj ’s value satisfies ConstrCk (Aj ). satisfied in that particular example. This is an additional As shown in Figure 2, loqr starts with a randomly-chosen benefit of our online, query-guided approach to query relax- subset D of the target database and creates one additional ation, and it is not shared by other approaches. dataset Dj for each attribute Aj that is constrained in Ck . Each dataset Dj is a copy of D that differs from the original 4. THE LOQR ALGORITHM only by the values of Aj : for each example in Dj , if the In this section we present a formal description of the loqr original value of Aj satisfies ConstrCk (Aj ), loqr sets Aj algorithm. We begin by briefly describing the syntax of the to yes; otherwise Aj is set to no. For each Dj , the binary input queries, after which we discuss the algorithm itself. attribute Aj is designated as the class attribute. After creating these additional datasets, loqr applies c4.5- 4.1 The query syntax rules [24] to each of them, thus learning decision rules that, loqr takes as input queries in the disjunctive normal form for each attribute Aj in Ck , predict whether Aj satisfies (dnf). Consequently, a failing query Q0 consists of a dis- ConstrCk (Aj ). In other words, these learned decision rules junction of conjunctions of the form Q0 = C1 C2 . . . Cn . represent patterns that use the values of some domain at- In turn, each Ck is a conjunction of constraints imposed on tributes to predict whether a particular constraint in Ck is (a subset of) the domain attributes: satisfied. 3 Note that this guarantee holds only if D is a subset of the 4.3 Step 2: Finding the “refiner statement” target database. If these two datasets are disjoint (see our experimental setup), QR may fail, even though it is highly After the learning step above, loqr converts each learned unlikely to do so. decision rule into the equivalent existentially-quantified state-
5 .ExtractDomainKnowledge (conjunction Ck , dataset D) Refine ( conjunction Ck , statement S) - Rules = ∅ - CRelax = ∅ FOR EACH attribute Aj that appears in Ck DO FOR EACH attribute Aj that appears in both Ck and S DO - create the following binary classification dataset Dj : IF Aj is discrete THEN - FOR EACH example ex ∈ D DO - CRelax = CRelax (ConstrCk (Aj ) ConstrS (Aj )) - make a copy ex of ex ELSE /*——————– Aj is continuous —————–*/ - IF ex .Aj satisfies ConstrCk (Aj ) IF ConstrCk (Aj ) and ConstrS (Aj ) are of “same type” THEN THEN set ex .Aj to “yes” - CRelax = CRelax Least(ConstrCk (Aj ), ConstrS (Aj )) ELSE set ex .Aj to “no” ELSE - add ex to Dj - CRelax = CRelax (ConstrCk (Aj ) ConstrS (Aj )) - designate Aj as the (binary) class attribute of Dj - return CRelax - apply the c4.5-rules algorithm to Dj - add these learned rules to Rules - return Rules Figure 3: Step 3: relaxing a failing conjunction. Figure 2: Step 1: query-guided extraction of domain 4.4 Step 3: Refining the failing conjunction knowledge expressed as decision rules. After finding the statement S that is the most similar to the failing conjunction Ck , loqr uses the domain knowledge ment. This is done by simply replacing “⇒” by “ ” in each synthesized by S to relax the constraints in Ck . As shown rule (remember than any decision rule “a b c ⇒ d” can in Figure 3, the relaxation works as follows: be interpreted as the existential statement “there are exam- - the relaxed conjunction CRelax includes only constraints ples in D such that a b c d”). on attributes that are present in both Ck and S; Note that the resulting existential statements have the same syntax as the failing conjunctions; i.e., they both rep- - if Aj is discrete, CRelax constrains its values to the ones resent a conjunction of constraints on the domain attributes. common to both ConstrCk (Aj ) and ConstrS (Aj ). If Consequently, loqr can detect the existential statement V alues(ConstrCk (Aj )) V alues(ConstrS (Aj )) = ∅, that is the most similar to Ck by performing an attribute- then CRelax does not impose any constraint on Aj . wise comparison between the constraints in Ck and those in each of the learned statements (i.e., loqr finds Ck ’s “nearest - if Aj is continuous, there are two possible scenarios: neighbor” among the existential statements). In order to evaluate the similarity between a conjunction 1. if ConstrCk (Aj ) and ConstrS (Aj ) use the same type Ck and a statement S, loqr use the function of inequality (e.g., one of > or ≥), then Cj con- tains the least constraining4 of ConstrCk (Aj ) and ConstrS (Aj ). For example, if the constraints are P rice < $1, 000 and P rice ≤ $799, then CRelax con- dist(Ck , S) = wj × Dist(Ck , S, Aj ) tains the former. Aj ∈Ck Aj ∈S 2. if ConstrCk (Aj ) and ConstrS (Aj ) use different types in which wj denotes the user-provided (relative) weight of of inequalities (e.g., one of them uses > or ≥, the attribute Aj . Dist(Ck , S, Aj ) is a measure of the sim- while the other one uses < or ≤), then CRelax con- ilarity between the constraints imposed on Aj by Ck and tains the intersection of the two constraints. For S; it has the range [0, 1] (the smaller the value, the more example, if the constraints are P rice < $1, 000 and similar the constraints) and is defined as follows: P rice ≥ $799, then CRelax contains P rice ∈ [799, 1000); if the constraints are P rice ≥ $1, 000 and - if the attribute Aj does not appear in both Ck and S, P rice < $799, their intersection is empty, which then Dist(Ck , Si , Aj ) = 1; i.e., Ck and S are highly means that CRelax imposes no constraint on P rice. dissimilar with respect to Aj . 5. EXPERIMENTAL RESULTS - if Aj takes discrete values, then In this section we begin with a brief overview of the five 0 if ConstrCk (Aj ) ConstrS (Aj ) = ∅ algorithms to be evaluated, followed by the description of Dist(Ck , Si , Aj ) = 1 otherwise the datasets, the experimental setup, and the actual results. In other words, Ck and S are highly similar with re- 5.1 The Algorithms spect to Aj if there is at least a value of Aj that is valid for both ConstrCk (Aj ) and ConstrS (Aj ). We empirically compare the performance of loqr with that of the following four algorithms: loqr-50, loqr-90, s- - if Aj takes continuous values, then nn, and r-nn. The first two are variants of loqr, while the |V alue(ConstrC (Aj ))−V alue(ConstrS (Aj ))| k other ones represent two baselines. Dist(Ck , Si , Aj ) = M axAj −M inAj The baselines work as follows: for each failing conjunction where V alue() returns the constraint’s numeric value, Ck , they use the distance measure from Section 4.3 to find while M axAj and M inAj are the largest and the small- the example Ex ∈ D that is the most similar to Ck . Then est value of Aj in D, respectively. Intuitively, the 4 smaller the relative difference between the values in For each continuous attribute, loqr requires the user to specify whether larger or smaller values are more desirable. the two constraints, the more similar the constraints. In our laptop scenario, the larger the hard disk, the better; conversely, the smaller the P rice the better, too.
6 .they use Ex to create a conjunction Ck that has the same scenario in which the relaxation algorithm exploits an alter- constraints as Ck , except that the original attribute values native information source that is somewhat representative are replaced by the corresponding values from Ex. The of the data in the target database. difference between s-nn and r-nn is that the former simply For each of the six domains, we have seven distinct failing returns Ck as the relaxed query, while the latter uses Ck to queries. We also consider various sizes of the dataset D: relax Ck as explained in Section 4.4. 50, 100, 150, . . . , 350 examples; for each of these sizes, we loqr-50 and loqr-90 represent variants of loqr that il- create 100 arbitrary instances of D, together with the 100 lustrate the trade-offs between the following strategies: “gen- corresponding test sets. For each size of D and each of the erate over-relaxed queries that are highly unlikely to fail, but seven failing queries, each query relaxation algorithms is run return a (relatively) large number of tuples” vs “create under- 100 times (once for each instance of D); consequently, the relaxed queries that return fewer tuples, but are more likely to results reported here are the average of these 700 runs. fail”. Both algorithms assume that the user designates one of the attributes as being the most relevant (ideally, the con- 5.3 The Results straint on this attribute should not be relaxed at all). In our experiments, we focus on two performance mea- For each failing conjunction Ck , loqr-50 and loqr-90 sures: run loqr, which computes the relaxed conjunction CRelax . After the user selects an attribute Aj from CRelax as the - robustness: what percentage of the failing queries are suc- most relevant, the algorithms cessfully relaxed (i.e., they don’t fail anymore)? - coverage: what percentage of the examples in the test set - apply QR to the dataset D and obtain the set CT of com- satisfy the relaxed query? pliant tuples, which consists of the examples covered by the decision rule used to relax the failing query. Figures 4 and 5 show the robustness and coverage results on the six evaluation domains. In terms of robustness, loqr - determine the set V of all values taken by the attribute Aj obtains by far the best results: independently of the size of over the dataset CT . D, on all six domains loqr’s robustness is above 90%; in fact, most of the robustness results are close or above 99% - replace the value in ConstrCk (Aj ) by “the most constrain- (i.e., about 99% of the queries are successfully relaxed). ing” 50- or 90- percentile value, respectively.5 For ex- In contrast, the two baselines, s-nn and r-nn, display ex- ample, if ConstrCk (Aj ) is P rice < $2000, loqr-90 re- tremely poor robustness: on lrs, pima, water, and wave places $2000 by a value v ∈ V such that exactly 90% their robustness is below 10%. The baselines’ best results of the values in V are worse (i.e., smaller) than v. are obtained on small-sized D (i.e., Size(D) = 50), where Similarly, if ConstrCk (Aj ) is CP U > 2.5 GHz , loqr-90 the scarcity of the training data makes it unlikely to find a replaces 2.5 GHz by a value v ∈ V such that exactly domain example that is highly-similar to the failing query; 90% of the values in V are larger then v. in turn this leads to an over-relaxation of the query, which improves the robustness. However, as Size(D) increases, the For all five algorithms above, we used equal weights (wj = performance of the two baselines degrades rapidly. 1) in the formula for dist(Ck , S), which measures the simi- The second measure of interest, coverage, must be consid- larity between two conjunctive formulas (see Section 4.3). ered in conjunction with the robustness results: even though we are interested in low-coverage results6 , a low-coverage, 5.2 The Datasets and the Setup non-robust algorithm is of little practical importance. Con- We evaluate the algorithms above on six different datasets. sequently, the low-coverage results of the two baselines (see The first one, laptops, is the original motivation for this Figure 5) must be put in perspective: after all, the vast work. It consists of 1257 laptop configurations extracted majority of the queries relaxed by these algorithms still fail. from yahoo.com; each laptop is described by five numeric loqr scores less spectacularly in terms of coverage: when attributes: price, cpu speed, ram, hdd space, and weight. learning from datasets of 350 examples, its coverage on the The other five domains are taken from the UC Irvine repos- six domains is between 2% and 19%. However, by trading- itory: breast cancer Wisconsin (bcw), low resolution spec- off robustness for coverage, loqr-90 obtains excellent overall trometer (lrs), Pima Indians diabetes (pima), water treat- results: when using 350 training examples, on all domains ment plant (water), and waveform data generator (wave). but bcw, loqr-90 reaches robustness levels between 69% In order to evaluate the performance of the five algorithms and 98%, while also keeping the coverage under 5%. above, we proceed as follows. Given a failing query Q and Last but not least, we are also interested in the amount a dataset D, each algorithm uses D to generate a relaxed of time spent relaxing a query: given that each query is query QR . In order to estimate its adequacy, QR is then processed online, it is crucial that loqr quickly relaxes the evaluated on a test set that consists of all examples in the incoming queries. In Table 3, we show the cpu time (in target database except the ones in D. We have chosen to seconds) that is spent refining the queries in the six applica- keep D and the test set disjoint (which may lead to the tion domains. Our results show that loqr is extremely fast: relaxed query failing on the test set) because in our moti- 6 Our motivation comes from Web-based information vating Web-based domains the owner of a database may be sources, for which high-coverage queries may be unaccept- unwilling to provide the dataset D. By keeping D and the able because of (1) the database’s owners unwillingness to test set disjoint, we can simulate (up to a certain level) the return large chunks of the data; (2) the bandwidth problems associated with transmitting huge datasets over the Inter- 5 Obviously, this strategy applies only to the continuous at- net; (3) the fee that one may have to pay for each returned tributes. For the discrete ones, the user is asked to select a tuple. Consequently, we are interested in query relaxations correspondingly small subset of the most desirable values. that return only a few, highly-relevant tuples.
7 . LAPTOPS BCW 100 100 90 90 LOQR LOQR LOQR-50 robustness (%) robustnes (%) 80 LOQR-50 80 LOQR-90 LOQR-90 s-NN 70 s-NN 70 r-NN r-NN 60 60 50 50 40 40 0 50 100 150 200 250 300 350 0 50 100 150 200 250 300 350 nmb. exs in D nmb. exs in D WATER LRS 100 100 90 80 LOQR 80 LOQR robustness (%) robustness (%) 70 LOQR-50 LOQR-50 60 LOQR-90 60 LOQR-90 s-NN s-NN 50 r-NN r-NN 40 40 30 20 20 10 0 0 0 50 100 150 200 250 300 350 0 50 100 150 200 250 300 350 nmb. exs in D nmb. exs in D PIMA WAVE 100 100 90 90 80 80 LOQR robustness (%) robustness (%) 70 70 LOQR-50 60 LOQR 60 LOQR-90 50 LOQR-50 50 s-NN 40 LOQR-90 40 r-NN s-NN 30 r-NN 30 20 20 10 10 0 0 0 50 100 150 200 250 300 350 0 50 100 150 200 250 300 350 nmb. exs in D nmb. exs in D Figure 4: Robustness: what percentage of the relaxed queries are not failing? queries that consist of at most 21 attributes are processed consists solely of “negative examples” (the constraint on A under 1.40 seconds; for queries that consist of 93 attributes, is not satisfied in any of the examples from D). it may take up to 30 seconds. To put things into perspective, a human cannot even read and comprehend a 93-attribute query in this amount of time; in fact, it is highly unlikely 6. DISCUSSION that humans are even willing to write 93-attribute queries. Before concluding this paper, we must discuss two im- Note that the running time is influenced both by the size portant design choices that heavily influence loqr’s perfor- of the dataset D and the number of attributes in the query. mance: the online and the query-driven nature of the learn- The former relationship is straightforward: the larger the ing process. The former refers to the fact that the learning dataset D, the longer it takes to learn the decision rules. step is performed at run-time, for each failing query. The The latter is more subtle: as loqr creates a new dataset latter specifies that the learning task is designed as a binary for each attribute in the query, it follows that the more at- classification problem in which the class attribute represents tributes in the query, the longer it takes to process the query. the boolean value “does Aj ’s value satisfy the constraints However, not all constraints are equally time consuming; for imposed onto it by the failing query?” example, if an attribute A is constrained to take a value that is out of the range of values encountered in D, then it is su- 6.1 Online vs offline learning perfluous to learn from the corresponding dataset, which In order to illustrate the advantages of our online ap- proach, we re-use the experimental setup above to compare
8 . LAPTOPS BCW WATER 50 8 40 45 LOQR LOQR LOQR LOQR-50 7 LOQR-50 35 LOQR-50 40 LOQR-90 6 LOQR-90 30 LOQR-90 coverage (%) coverage (%) coverage (%) 35 s-NN s-NN s-NN 30 r-NN 5 r-NN 25 r-NN 25 4 20 20 3 15 15 2 10 10 5 1 5 0 0 0 0 50 100 150 200 250 300 350 0 50 100 150 200 250 300 350 0 50 100 150 200 250 300 350 nmb. exs in D nmb. exs in D nmb. exs in D LRS PIMA wave 70 30 LOQR LOQR 60 LOQR 60 LOQR-50 25 LOQR-50 LOQR-50 LOQR-90 LOQR-90 50 LOQR-90 coverage (%) coverage (%) coverage (%) 50 s-NN 20 s-NN s-NN r-NN r-NN 40 r-NN 40 15 30 30 10 20 20 10 5 10 0 0 0 0 50 100 150 200 250 300 350 0 50 100 150 200 250 300 350 0 50 100 150 200 250 300 350 nmb. exs in D nmb. exs in D nmb. exs in D Figure 5: Coverage: what percentage of the examples in the test set match the relaxed query? Dataset Attributes cpu time (seconds per query) demonstrating the superiority of the online, query-guided per query |D| = 50 |D| = 350 approach. laptops 5 0.06 0.15 pima 8 0.15 0.47 6.2 Query-driven learning bcw 10 0.14 0.37 As we have already seen, guiding the learning process by wave 21 0.46 1.39 the constraints that appear in the failing query leads to dra- water 38 0.87 4.40 matic robustness improvements. However, the query-driven lrs 93 4.32 30.71 approach used by loqr is by no means the only possible approach. In fact, we can distinguish four main scenarios: Table 3: Running times for loqr, averaged over the runs on each of the 100 instances of D that are cre- - no constraints: this approach corresponds to offline learn- ated for 50 and 350 examples, respectively. ing, in which none of the query constraints are used to guide the learning process. - class-attribute constraints: this is the approach used by loqr, in which we create a dataset D for each attribute loqr with an offline variant, off-k. The only difference be- in the query. Each such dataset uses exactly one of the tween the two algorithms is that off-k performs the learn- failing query’s constraints to guide the learning; more ing step only once, independently of the constraints that precisely, one of the constrained attributes becomes appear in the failing queries. Similarly to loqr, off-k also the designated class-attribute that takes two discrete tries to predict an attribute’s value from those of the other values (i.e., does/does-not satisfy the constraint). attributes; however, because it does not have access to the constraints in the failing query, off-k proceeds as follows: - set of hard constraints: this scenario represents a straight- for discrete attributes, it learns to predict each discrete value forward generalization of the previous one. If the user from the values of the other attributes; for continuous at- specifies a subset of M of the failing query’s constraints tributes, it discretizes the attribute’s range of values in D so that must be satisfied (i.e., hard constraints), one can that it obtains a number of k intervals that have equal size. then replace the corresponding M attributes by a sin- In this empirical comparison, we use two offline versions gle discrete one that represents the conjunction of the (i.e., k = 2 and k = 3) for both loqr and loqr-90. Figures M hard constraints and then apply loqr. 6 and 7 show the robustness results7 for loqr and loqr-90, respectively (off-2 and off-3 denote the two offline ver- - all constraints: this final scenario correspond to simultane- sions of each algorithm). The graphs show that both loqr ously replacing the original values of all the attributes and loqr-90 clearly outperform their offline variants,8 thus perform off-2. This is because the discretization of the con- 7 tinuous attributes is made independently of the values from Because of space limitations, we do not show the coverage the failing query: as the discretization takes place offline, the results. However, they can be summarized as follows: on all values that appear in query’s constraints may lay anywhere six domains, the coverage of the online and offline algorithms within a discretized interval. Consequently, the “purity” of are extremely close. On water and lrs, both loqr and the discretized class that includes the query value may vary loqr-90 outperform their offline counterparts, while on bcw wildly (e.g., almost all values in that interval may or may and wave the performance is virtually the same. not satisfy the constraint from the query), which - in turn - 8 Figures 6 and 7 also show that off-3 does not always out- dramatically affects the quality of the relaxed query.
9 . LAPTOPS BC_WISC water 100 100 100 99.5 95 99 95 robustness (%) robustness (%) robustness (%) 98.5 90 98 90 97.5 85 97 85 96.5 online online online 80 off-2C 96 off-2C 80 off-2C off-3C 95.5 off-3C off-3C 75 95 75 0 50 100 150 200 250 300 350 0 50 100 150 200 250 300 350 0 50 100 150 200 250 300 350 nmb. exs in D nmb. exs in D nmb. exs in D LRS PIMA wave 100 100 100 98 95 95 robustness (%) robustness (%) robustness (%) 96 90 94 90 85 92 85 80 90 online online online off-2C 80 off-2C 75 off-2C 88 off-3C off-3C off-3C 86 75 70 0 50 100 150 200 250 300 350 0 50 100 150 200 250 300 350 0 50 100 150 200 250 300 350 nmb. exs in D nmb. exs in D nmb. exs in D Figure 6: Online vs Offline: robustness results for loqr. in the query with the corresponding boolean value (i.e., 8. ACKNOWLEDGMENTS does/does-not satisfy the constraints in the query). This material is based upon work supported by the De- fense Advanced Research Projects Agency (DARPA), through In this paper we have analyzed the first two of the sce- the Department of the Interior, NBC, Acquisition Services narios above. The third one represents a straightforward Division, under Contract No. NBCHD030010. extension of loqr that was not addressed here because of We would like to thank Melinda Gervasio, Karen Myers, space limitations. Finally, the last scenario has both advan- Maria Muslea, and Tomas Uribe for their helpful comments tages and disadvantages over loqr. On one hand, by si- on this paper. We also thank Steven Minton and Fetch Tech- multaneously replacing the values of all attributes with the nologies, Inc. for providing us with the Laptops dataset. corresponding boolean values, the “all constraints” scenario creates a single dataset D instead that one per attribute; in turn this leads to a considerable gain in terms of pro- 9. REFERENCES cessing speed. On the other hand, in the “all constraints” [1] S. Amer-Yahia, S. Cho, and D. Srivastava. Tree scenario there is only one way to relax a query, namely by pattern relaxation. In International Conference on dropping constraints. In contrast, loqr permits both con- Extending Database Technology EDBT, 2002. straint dropping and constraint relaxation (i.e., replacing [2] R. A. Baeza-Yates and B. A. Ribeiro-Neto. Modern the original value by a less constraining one), thus providing Information Retrieval. Addison-Wesley, 1999. a significantly more flexible solution to the query relaxation [3] K. Chakrabarti, M. Ortega, S. Mehrotra, and problem. K. Porkaew. Evaluating refined queries in top-k retrieval systems. IEEE Transactions on Knowledge 7. CONCLUSIONS & FUTURE WORK and Data Engineering, 15(5), 2003. In this paper we have introduced a novel, data-driven ap- [4] S. Chaudhuri. Generalization and a framework for proach to query relaxation. Our algorithm, loqr, performs query modification. In Proceedings of the Sixth online, query-driven learning from a small subset of the tar- International Conference on Data Engineering, get database. The learned information is then used to relax February 5-9, 1990, Los Angeles, California, USA, the constraints in the failing query. We have shown empiri- pages 138–145. IEEE Computer Society, 1990. cally that loqr is a fast algorithm that successfully relaxes [5] S. Chaudhuri and L. Gravano. Evaluating top-k the vast majority of the failing queries. selection queries. In Proceedings of the Conference on We intend to continue our work on query relaxation along Very Large Databases, pages 397–410, 1999. several directions. First, we plan to extend our data-driven [6] W. Chu, Q. Chen, and A. Huang. Query answering approach by also exploiting user preferences that are learned via cooperative data inference. Journal of Intelligent as the system is in use. Second, we are interested in a Information Systems, 3(1):57–87, 1994. query visualization algorithm that would allow a user to [7] W. Chu, K. Chiang, C.-C. Hsu, and H. Yau. An explore the trade-offs between the various possible query re- error-based conceptual clustering method for laxations. Finally, we plan to integrate the query visualiza- providing approximate query answers. tion module in a mixed initiative system in which the user Communications of acm, 39(12):216–230, 1996. interacts with the query relaxation algorithm by expressing [8] W. Chu, R. C. Lee, and Q. Chen. Using type various preferences over the domain attributes. interfaces and induced rules to provide intentional
10 . LAPTOPS BC_WISC water 95 70 95 90 65 90 robustness (%) robustness (%) robustness (%) 85 60 85 80 55 80 75 50 75 on--90 on--90 on--90 70 off-2C 45 off-2C 70 off-2C off-3C off-3C off-3C 65 40 65 0 50 100 150 200 250 300 350 0 50 100 150 200 250 300 350 0 50 100 150 200 250 300 350 nmb. exs in D nmb. exs in D nmb. exs in D LRS PIMA wave 100 95 100 98 90 85 95 96 robustness (%) robustness (%) robustness (%) 80 90 94 75 92 70 85 90 65 60 80 88 on--90 on--90 on--90 off-2C 55 off-2C 75 off-2C 86 50 off-3C off-3C off-3C 84 45 70 0 50 100 150 200 250 300 350 0 50 100 150 200 250 300 350 0 50 100 150 200 250 300 350 nmb. exs in D nmb. exs in D nmb. exs in D Figure 7: Online vs Offline: robustness results for loqr and loqr-90. answers. In Proceedings of the Seventh International [17] M. Kao, N. Cercone, and W.-S. Luk. Providing Conference on Data Engineering, 1991. quality responses with natural language interfaces: [9] W. Chu, H. Yang, K. Chiang, M. Minock, G. Chow, The null value problem. IEEE Transactions on and C. Larson. Cobase: A scalable and extensible Software Engineering, 14(7):959–984, 1988. cooperative information system. Journal of [18] S. Kaplan. Cooperative aspects of database Intelligence Information Systems, 6(2/3):223–59, 1996. interactions. Artificial Intelligence, 19(2):165–87, 1982. [10] F. Corella, S. J. Kaplan, G. Wiederhold, and L. Yesil. [19] D. A. Keim and H. Kriegel. VisDB: Database Cooperative responses to boolean queries. In exploration using multidimensional visualization. Proceedings of the 1st International Conference on Computer Graphics and Applications, 1994. Data Engineering, pages 77–85, 1984. [20] D. Lee. Query Relaxation for XML Model. PhD thesis, [11] N. Fuhr and K. Grosjohann. XIRQL: A query Department of Computer Science, University of language for information retrieval in XML documents. California Los Angeles, 2002. In Research and Development in Information [21] M. Merzbacher and W. Chu. Pattern-based clustering Retrieval, pages 172–180, 2001. for database attribute values. In Proceedings of AAAI [12] T. Gaasterland. Cooperative answering through Workshop on Knowledge Discovery in Databases, 1993. controlled query relaxation. IEEE Expert, 12(5):48–59, [22] A. Motro. seave: a mechanism for verifying user 1997. presupositions in query system. acm Transactions on [13] A. Gal. Cooperative responses in deductive databases. Information Systems, 4(4):312–330, 1986. PhD thesis, Department of Computer Science, [23] A. Motro. Flex: A tolerant and cooperative user University of Maryland, College Park, 1988. interface databases. IEEE Transactions on Knowledge [14] P. Godfrey. Minimization in cooperative response to and Data Engineering, 2(2):231–246, 1990. failing database queries. International Journal of [24] R. Quinlan. C4.5: programs for machine learning. Cooperative Information Systems, 6(2):95–149, 1997. Morgan Kaufmann Publishers, 1993. [15] J. M. Janas. Towards more informative user interfaces. [25] A. Theobald and G. Weikum. Adding relevance to In A. L. Furtado and H. L. Morgan, editors, Fifth XML. Lecture Notes in Computer Science, International Conference on Very Large Data Bases, 1997:105–131, 2001. October 3-5, 1979, Rio de Janeiro, Brazil, Proceedings, pages 17–23. IEEE Computer Society, 1979. [16] Y. Kanza and Y. Sagiv. Flexible queries over semistructured data. In Proceedings of the 20th ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, pages 40–51, 2002.