HOT

CrowdScreen: Algorithms for Filtering Data with Humans
6 点赞
2 收藏
0下载
Given a set of data items, we consider the problem of filtering them based on a set of properties that can be verified by humans. This problem is commonplace in crowdsourcing applications, and yet, to our knowledge, no one has considered the formal optimization of this problem. (Typical solutions use heuristics to solve the problem.) We formally state a few different variants of this problem.We develop deterministic and probabilistic algorithms to optimize the expected cost (i.e., number of questions) and expected error.We experimentally show that our algorithms provide definite gains with respect to other strategies. Our algorithms can be applied in a variety of crowdsourcing scenarios and can form an integral part of any query processor that uses human computation.

3. Figure 1: (a) Visualization of a Triangular Strategy (b) Visualization of a Rectangular Strategy (c) Error at Terminating Points vs. Overall Error the same strategy is applied to each item in the data set, and com- points (where Term(x, y) does not hold). plete, i.e., the strategy tells us what to do at each reachable point in • C is the expected cost across all termination points, i.e.,: the grid. In addition, we want our strategies to be terminating, i.e., the C= C(x, y) × [p0 (x, y) + p1 (x, y)] (3) strategy always terminates in a finite number of steps, no matter (x,y) what sequence of YES/NO answers are received to the questions Note that to evaluate our n items using the same strategy, we (a terminating strategy effectively corresponds to a closed shape will incur an expected cost of nC. around the origin). Note that the strategies in Figures 1(a) and 1(b) are terminating. We will enforce a termination constraint in our To illustrate these metrics, consider the simple deterministic strat- problem formulations in Section 2.3. egy in Figure 1(c), and assume that s = 0.5, e1 = 0.1, e0 = 0.2. We also want our strategies to be fully determined in advance, At each termination point (x, y) we show E(x, y). For example, so that we do not need to do any computation on the fly while the E(0, 2) = 0.05. This number can be interpreted as follows: In 5% answers are received. Thus, we present algorithms that compute of the cases where we end up terminating at (0, 2), an error will be complete strategies. made. In the remaining 95% of the cases a correct (Pass) decision will be made. To compute E(0, 2), we need the p0 and p1 values. 2.2 Metrics Since there is only a single way to get to (0, 2) (two consecutive YESs) the computation is simple: p0 (0, 2) is (1 − s) (V must be To determine which strategy is best, we study two types of met- 0) times twice e0 , i.e., 0.02 (thus, p0 (0, 2) = (1 − s) × e20 ). Simi- rics, involving error and cost. We start by defining two quantities, larly, p1 (0, 2) = s × (1 − e1 )2 = 0.5 × 0.9 × 0.9 = 0.405. Thus, given a strategy: E(0, 2) = 0.02/[0.02 + 0.405] = 0.05. When there are multiple • p1 (x, y) is the probability that the strategy reaches point (x, y) ways to get to a point (e.g., for (1, 1)) the p0 , p1 computation is and the item satisfies the filter (V = 1); and more complex, and will be discussed in Section 2.4. • p0 (x, y) is the probability that the strategy reaches point (x, y) The expected error E for this sample strategy is the error weighed and the item does not satisfy the filter (V = 0). by the probability of getting to each termination point. In this case, Below we give a simple example to illustrate these quantities, and it turns out that E = 0.115. Notice the difference between the in Section 2.4 we show how to compute them in general. Further- overall error E and the individual termination point errors: The more, let us say if a Pass decision is made at (x, y) then Pass(x, y) error at (1, 1) is quite high, but because it is not very likely that holds, and that if a Fail decision is made, then Fail(x, y) holds. If we end up at (1, 1), that error does not contribute as much to the either decision is made, we say that Term(x, y) holds. overall E. We can now define the following metrics: • E(x, y) (only of interest when Term(x, y) holds) is the prob- 2.3 The Problems ability of error given that the strategy terminated at (x, y). If Given input parameters s, e1 , and e0 , the search for a “good” or Pass(x, y), then an error is made if V = 0, so E(x, y) is “optimal” strategy can be formulated in a variety of ways. Since we p0 (x, y) divided by the probability that the strategy reached want to ensure that the strategy terminates, we enforce a threshold (x, y) (i.e., divided by p0 (x, y) + p1 (x, y)). The Fail(x, y) m on the maximum number of questions we can ask for any item case is analogous, so we get: (which is nothing but a maximum budget for any item that we want  to filter). p0 (x,y)  [p0 (x,y)+p1 (x,y)] if Pass(x, y)  We start with the problem we will focus on in this paper: p1 (x,y) E(x, y) = if Fail(x, y) (1) P ROBLEM 1 (C ORE ). Given an error threshold τ and a bud-  [p0 (x,y)+p1 (x,y)]  0 else get threshold per item m, find a strategy that minimizes C under the constraint E < τ and ∀(x, y) C(x, y) < m. • E is the expected error across all termination points. That is: An alternative would be to constrain the error at each termination E= E(x, y) × [p0 (x, y) + p1 (x, y)]. (2) point: (x,y) P ROBLEM 2 (C ORE : P ER -I TEM E RROR ). Given an error thresh- • C(x, y) (only of interest when Term(x, y) holds) is the num- old τ and a budget threshold per item m, find a strategy that mini- ber of questions used to reach a decision at (x, y), i.e., simply mizes C under the constraint ∀(x, y) E(x, y) < τ and C(x, y) < x + y. We consider C(x, y) to be zero at all non-termination m.

4.    p0 (x − 1, y)(1 − e0 ) + p0 (x, y − 1)e0 if ¬Term(x, y − 1) ∧ ¬Term(x − 1, y) p0 (x, y − 1)e0 if ¬Term(x, y − 1) ∧ Term(x − 1, y)  p0 (x, y) =  p0 (x − 1, y)(1 − e0 )  if Term(x, y − 1) ∧ ¬Term(x − 1, y)  0 if Term(x, y − 1) ∧ Term(x − 1, y)    p1 (x, y − 1)(1 − e1 ) + p1 (x − 1, y)e1 if ¬Term(x, y − 1) ∧ ¬Term(x − 1, y) p1 (x, y − 1)(1 − e1 ) if ¬Term(x, y − 1) ∧ Term(x − 1, y)  p1 (x, y) =   p1 (x − 1, y)e1 if Term(x, y − 1) ∧ ¬Term(x − 1, y)  0 if Term(x, y − 1) ∧ Term(x − 1, y) Figure 2: Recursive Equations for p0 and p1 In Problem 2 we ensure that the error is never above threshold, to (x, y) is the sum of two quantities: (a) the probability of the item even in very unlikely executions. This formulation may be pre- not satisfying the filter and getting to (x, y −1) and getting an extra ferred when errors are disastrous, e.g., if our filter is checking for YES, and (b) the probability of the item not satisfying the filter and patients with some disease, or for defective automobiles. However, getting to (x − 1, y) and getting an extra NO. The probability of in other cases we may be willing to tolerate uncertainty in some getting an extra YES given that the item does not satisfy the filter individual decisions (i.e., high E(x, y) at some points), in order is precisely e0 , and the probability of getting a NO is (1 − e0 ). We to reduce costs, as long as the overall error E is acceptable. For can write similar equations for p1 , as shown in the figure. instance, say we are filtering photographs that will be used by an Given values for p0 and p1 , we can use the definitions of Sec- internet shopping site. In some unlikely case, we may get 10 YES tion 2.1 to compute the errors and costs at each point, and the over- and 10 NO votes, which may mean we are not sure if the photo sat- all error and cost. Note that we do not have to compute the p values isfies the filter. In this case we may prefer to stop asking questions at all points, but only for the reachable points, as all other points to contain costs and make a decision, even though the error rate have zero error E(x, y) and cost C(x, y). at this point will be high no matter what we decide. In Section 7 we study the price one pays (number of questions) for adopting 3. DETERMINISTIC STRATEGIES the more conservative approach of Problem 2 under the same error This section develops algorithms that find effective deterministic threshold. strategies for Problem 1. Instead of minimizing the overall cost, one could minimize the maximum cost at any given point, as in the next problem: 3.1 The Paths Principle P ROBLEM 3 (M AXIMUM C OST ). Given an error threshold τ , Given a deterministic strategy, using Equations 1, 2, 3 and Fig- find a strategy that minimizes the maximum value of C(x, y) (over ure 2 given earlier, it is easy to see that the following theorem holds: all points), under the constraint E < τ . T HEOREM 3.1 (C OMPUTATION OF C OST AND E RROR ). The expected cost and error of a strategy can be computed in time pro- Another variation is to minimize error, given some constraint on portional to the number of reachable grid points. the number of questions. We specify two such variants next. In the first variant we have a maximum budget for each item we want to Based on the above theorem, we have a brute force algorithm to filter, while in the second variant we have an additional constraint find the best deterministic strategy, namely by examining strategies on the expected overall cost. corresponding to all possible assignments of Pass, Fail or Continue (i.e., continue asking questions) to each point in (0, 0) to (m, m). P ROBLEM 4 (E RROR ). Given a budget threshold per item m, (There are 3m such assignments.) Evaluating cost and error for find a strategy that minimizes E under the constraint ∀(x, y) C(x, y) < each strategy takes time O(m2 ) using the recursive equations. We m. select the one that satisfies the error threshold, and minimizes cost. We call this algorithm naive3. P ROBLEM 5 (E RROR - II). Given a budget threshold per item Note that some of these strategies are not terminating. However, m and a cost threshold α, find a strategy that minimizes E under termination can also be checked easily for each strategy considered the constraints ∀(x, y) C(x, y) < m and C < α. in time proportional to the number of reachable grid points in the strategy. (If the p0 and p1 values at all points on the line x + y = 2.4 Computing Probabilities at Grid Points m , for some m ≤ m is zero, then the strategy is terminating.) In this subsection we show how to compute the p0 and p1 values defined in the previous subsection. We focus on a deterministic T HEOREM 3.2 (B EST S TRATEGY: NAIVE 3). The naive3 al- strategy (probabilistic strategies are discussed in Section 4). gorithm finds the best strategy for Problem 1 in O(m2 3m ). We compute the p values recursively, starting at the origin. In particular, note that p0 (0, 0) is (1 − s) and p1 (0, 0) is s. (For We are able to reduce significantly the search space of the naive instance, p0 (0, 0) is the probability that the item does not satisfy algorithm by excluding the provably suboptimal strategies. Our the filter and the strategy visits the point (0, 0) — which it has to.) exclusion criterion is based on the following fundamental theorem. We can then derive the probability p0 and p1 for every other point T HEOREM 3.3 (PATHS P RINCIPLE ). Given s, e1 , e0 , for ev- in the grid as shown in Figure 2. (Recall that Term(x, y) means ery point (x, y), the function R(x, y) = p0 (x, y)/(p0 (x, y) + that (x, y) is a termination point, either Pass or Fail.) p1 (x, y)) is a function of (x, y), independent of the particular (de- To see how these equations work, let us consider the first case for terministic or probabilistic) strategy. p0 , where neither (x − 1, y) nor (x, y − 1) are termination points. Note that we can get to (x, y) in two ways, either from (x, y − 1) P ROOF. Consider a single sequence of x No answers and y Yes on getting an extra YES, or from (x − 1, y) on getting an extra NO. answers. The probability that an item satisfies the filter, and gets the Thus, the probability of an item not satisfying the filter and getting particular sequence of x No answers and y Yes answers is precisely

5. Figure 3: (a) A Shape (b) Triangular Strategy Corresponds to a Shape (c) Ladder Shape Pruning a = s × ex1 × (1 − e1 )y , while the probability that an item does not 3.2 Shapes and Ladders satisfy the filter and gets the same sequence is b = (1 − s) × ey0 × In practice, considering all 2m strategies is computationally fea- (1 − e0 )x . The choice of strategy may change the number of such sible only for very small m. In this section, we design algorithms paths to the point (x, y), however the fraction of p0 to p0 + p1 is that only consider a subset of these 2m strategies and thereby can still b/(a + b). (Note that each path precisely adds a to p1 and b to only provide an approximate solution to the problem (i.e., the ex- p0 . Thus, if there are r paths, p0 /(p0 + p1 ) = (r × b)/[r × (a + pected cost may be slightly greater than the expected cost of the b)] = b/(a + b).) It can be shown that the proof also generalizes to optimal solution). The subset of strategies that we are interested in probabilistic strategies. (In that case, r can be a fractional number are those that correspond to shapes. We will describe an efficient of paths.) way of obtaining the best strategy that corresponds to a shape. Intuitively, this theorem holds because the strategy only changes the number of paths leading to a point, but the characteristics of the Shapes: A shape is defined by a connected sequence of (horizontal point stay the same. or vertical) segments on the grid, beginning at a point on the y- Using the previous theorem, we have the result that in order to axis, and ending at a point on the x axis, along with a special point reduce the error, for every termination point (x, y), Passing or Fail- somewhere along the sequence of segments, called a decision point. ing is independent of strategy, but is based simply on R(x, y). We also assume that each segment intersects at most with two other lines, namely the ones preceding and following it in the sequence. T HEOREM 3.4 (F ILTERING I NDEPENDENT OF S TRATEGY ). As an example, consider the shape in Figure 3(a) (ignore the dashed For every optimal strategy, for every point (x, y), if Term(x, y) lines for now) This shape begins at (0, 4), has a sequence of 14 holds, then: segments, and ends at (4, 0). The decision point (not shown in the • If R(x, y) > 1/2, then Fail(x, y) figure) is (for example) at (5, 5). As seen in the figure, the segments are allowed to go in any direction (up/down or left/right). • If R(x, y) < 1/2, then Pass(x, y) Each shape corresponds to precisely one strategy, namely the one P ROOF. Given a strategy, let there be a point (x, y) for which defined as follows: Pass(x, y) holds, but R(x, y) > 1/2. Let the error be: • For each point in the sequence of segments starting at the point on the y axis, until and including the decision point, E = E0 + E(x, y)(p0 (x, y) + p1 (x, y)) we color the point blue (i.e., we designate the point as Pass). (We split the error into two parts, one dependent on other termi- In the figure, all points in the sequence of segments starting nating points, and one just dependent on (x, y).) Currently, since from (0, 4) until and including (5, 5) are colored blue. (x, y) is Pass, E(x, y) is p0 (x, y)/(p0 (x, y) + p1 (x, y)). Thus, • For each point in the sequence of segments starting at the E = E0 + p0 (x, y). If we change (x, y) to Fail, we get E = point after the decision point, until and including the point E0 + p1 (x, y), then E < E. Thus, by flipping (x, y) to Fail, on the x axis, we color the point red (i.e., we designate the we can only reduce the error. Similarly, if R(x, y) < 1/2 and point as Fail). In the figure, all points after (5, 5) on the Fail(x, y) holds, we can only reduce the error by flipping it to sequence of segments are colored red. Pass(x, y). • For all the points inside or on the shape that are reachable, Thus, we only need to consider 2m strategies, namely those where we color them green; else we color them white. (Some points for each point, we can either set it to be a continue point or a termi- colored blue or red previously may actually be unreachable nation point, and if it is a termination point, then using the previous and will be colored white in this step.) In the figure, some theorem, we can infer whether it should be Pass or Fail.1 The algo- of the blue points, such as (2, 5), (1, 5), (1, 6), . . ., (5, 5), rithm that considers all such 2m strategies is called naive2. Thus, and some of the red points (5, 4), . . ., (4, 3), and (4, 1) are we have the following theorem: colored white, while some of the unreachable points inside the shape, such as (3, 4) and (4, 4) are colored white as well. T HEOREM 3.5 (B EST S TRATEGY: NAIVE 2). The naive2 al- The reachable points inside the shape, like (1, 1) or (2, 3) are gorithm finds the best strategy for Problem 1 in O(m2 2m ). colored green. 1 Note that there is a subtle point here. When R(x, y) = 1/2 then The strategies that correspond to shapes form a large and diverse the point can either be a Pass or a Fail point, with equal effect. (It class of strategies. In particular, the triangular strategy and rect- will contribute the same amount of error either way.) For simplicity angular strategy both correspond to shapes. Consider Figure 3(b), we assume it to always return Pass if this holds. which depicts a shape corresponding to the triangular strategy of

8. • Constraints 4, 5, 6 and 7 transform into: T HEOREM 5.2. The best strategy for Problem 4 can be found in O(m2 ). E= tP ath(x, y) × min(S0 (x, y), S1 (x, y)) (x,y);x+y≤m (We can actually compute the best strategy in O(m) if binomial C= tP ath(x, y)(x + y)(S0 (x, y) + S1 (x, y)) expressions involving m can be computed in O(1) time.) Subsequently, we can solve Problem 3 by performing repeated (x,y);x+y≤m doubling of the maximum cost m, until we find a triangular shape (Recall that S0 , S1 are constants, so the constraints are lin- for which E < τ , followed by binary search between the two ear.) In other words, E is precisely the number of paths lead- thresholds m and m/2. Thus, we have the following theorem: ing to the point that terminate at that point, times the smaller of the two error probabilities. The cost C is simply the cost T HEOREM 5.3. The best strategy for Problem 4 can be found for all paths terminating at (x, y) (each such path has proba- in O(m2 log m). bility S0 + S1 ). (We can actually compute the best strategy in O(m log m) if bino- • Constraints 8 and 9 transform into: mial expressions involving m can be computed in O(1) time.) ∀(x, y); x + y ≤ m : cP ath(x, y) + tP ath(x, y) = cP ath(x, y − 1) + cP ath(x − 1, y) 5.3 Problem 5 For Problem 5, we can use the same linear programming formu- In other words, the number of paths into (x, y) are precisely lation as in Section 2.4, except that we constrain C and minimize those that come from (x, y − 1) and (x − 1, y) E. We thus have the following: • We replace constraint 10 with the following, which implies T HEOREM 5.4. The best probabilistic strategy for Problem 5 that no paths go beyond x + y = m. can be found in O(m4 ). ∀(x, y); x + y = m : cP ath(x, y) = 0 • We also have the constraint that there is a single path to 6. MULTIPLE FILTERS (0, 0), i.e. Now, we consider the case when we have independent filters f1 , f2 , . . . , fk , with independent selectivities s1 , s2 , . . . , sk , and cP ath(0, 0) + tP ath(0, 0) = 1 independent error rates e10 , e11 , e20 , e21 , . . . , ek0 , ek1 , where ei0 is the probability that a human answers YES when the item actu- No additional constraints exist. ally does not pass the ith filter, and ei1 is the probability that the Since linear programs can be solved in time quadratic in the human answers NO when the item actually does pass the ith filter. number of variables and constraints, we have the following theo- Note that the multiple filters problem is much harder than the single rem: filter one, since a strategy not only must decide when to continue T HEOREM 4.1 (B EST P ROBABILISTIC S TRATEGY ). The best asking questions, but must also now decide which question to ask probabilistic strategy for Problem 1 can be found in O(m4 ). next (i.e., for what filter). The problem becomes even harder when We denote the algorithm corresponding to the linear program above the filters are correlated, however, we leave it for future work. as linear. We can visualize the multiple filters case as a 2k dimensional grid, where each point (x10 , x11 , x20 , x21 , . . . , xk0 , xk1 ) corre- sponds to the number of NO and YES answers received from hu- 5. OTHER FORMULATIONS mans for each of the filters. (xi0 indicates the number of NO an- 5.1 Problem 2 swers for the ith filter, and xi1 indicates the number of YES an- swers for the ith filter.) For Problem 2, we can show that we simply need to compute In its most general form, the multiple filters problem allows us, R(x, y) for every point in (0, 0) to (m, m), bottom up, and for at any point on this 2k-dimensional grid, to ask a question corre- every point where we find that min(R(x, y), 1 − R(x, y)) < τ , we sponding to any of the k filters. Thus, each point can correspond to make the point a terminating point, returning Fail if R(x, y) ≤ 0.5 “Pass”, “Fail”, or “Ask ith Filter”, where i can be one from 1 . . . k. and Fail otherwise. Using a general form of the Paths Principle (Theorem 3.3), if a In fact, we can actually terminate earlier if we find that p0 and point is a terminating point, we can infer whether it should be a p1 are 0 for all points (x, y) : x + y = m at some m < m. In this Pass or a Fail point. case, we do not need to proceed beyond points along x + y = m . We can in fact represent the problem as a linear program, gener- Note that a feasible strategy that terminates and satisfies E(x, y) < alizing the linear program in Section 4. The counterpart of the vari- τ for every terminating point may not exist. ables tP ath and cP ath are tP ath and cP ath1 , . . . , cP athk , rep- Thus, we have the following theorem: resenting respectively, the number of paths terminating at a given T HEOREM 5.1. The best strategy for Problem 2 can be found point and the number of paths continuing in the direction of each in O(m2 ). of the k filters. Similarly, the total number of paths coming into a point is simply the paths coming in from each of the k directions. 5.2 Problems 3 and 4 T HEOREM 6.1. The best probabilistic strategy for the multiple We begin by considering Problem 4 first. This problem formu- filters version of Problem 5 can be found in O(m4k ). lation is identical to Maximum A-Posteriori (MAP) estimation. In this case we simply ask all m questions (since there is no gain to This algorithm is exponential in the number of filters k, which may asking fewer than m questions). We can estimate the p0 and p1 at not be very large. However, any algorithm whose output is a strat- all points along x + y = m for this triangular strategy. If p0 > p1 , egy using our visualization would need Ω(m2k ), since we need to we return Fail at that point and Pass otherwise. We can then esti- provide a decision for each point in the 2k dimensional cube of size mate the error E over all the terminating points. m2k . It remains to be seen if we can find an optimal or approxi- Thus, we have the following theorem: mately optimal strategy whose representation is smaller.