# CrowdScreen: Algorithms for Filtering Data with Humans

Given a set of data items, we consider the problem of filtering them based on a set of properties that can be verified by humans. This problem is commonplace in crowdsourcing applications, and yet, to our knowledge, no one has considered the formal optimization of this problem. (Typical solutions use heuristics to solve the problem.) We formally state a few different variants of this problem.We develop deterministic and probabilistic algorithms to optimize the expected cost (i.e., number of questions) and expected error.We experimentally show that our algorithms provide definite gains with respect to other strategies. Our algorithms can be applied in a variety of crowdsourcing scenarios and can form an integral part of any query processor that uses human computation.

4.    p0 (x − 1, y)(1 − e0 ) + p0 (x, y − 1)e0 if ¬Term(x, y − 1) ∧ ¬Term(x − 1, y) p0 (x, y − 1)e0 if ¬Term(x, y − 1) ∧ Term(x − 1, y)  p0 (x, y) =  p0 (x − 1, y)(1 − e0 )  if Term(x, y − 1) ∧ ¬Term(x − 1, y)  0 if Term(x, y − 1) ∧ Term(x − 1, y)    p1 (x, y − 1)(1 − e1 ) + p1 (x − 1, y)e1 if ¬Term(x, y − 1) ∧ ¬Term(x − 1, y) p1 (x, y − 1)(1 − e1 ) if ¬Term(x, y − 1) ∧ Term(x − 1, y)  p1 (x, y) =   p1 (x − 1, y)e1 if Term(x, y − 1) ∧ ¬Term(x − 1, y)  0 if Term(x, y − 1) ∧ Term(x − 1, y) Figure 2: Recursive Equations for p0 and p1 In Problem 2 we ensure that the error is never above threshold, to (x, y) is the sum of two quantities: (a) the probability of the item even in very unlikely executions. This formulation may be pre- not satisfying the filter and getting to (x, y −1) and getting an extra ferred when errors are disastrous, e.g., if our filter is checking for YES, and (b) the probability of the item not satisfying the filter and patients with some disease, or for defective automobiles. However, getting to (x − 1, y) and getting an extra NO. The probability of in other cases we may be willing to tolerate uncertainty in some getting an extra YES given that the item does not satisfy the filter individual decisions (i.e., high E(x, y) at some points), in order is precisely e0 , and the probability of getting a NO is (1 − e0 ). We to reduce costs, as long as the overall error E is acceptable. For can write similar equations for p1 , as shown in the figure. instance, say we are filtering photographs that will be used by an Given values for p0 and p1 , we can use the definitions of Sec- internet shopping site. In some unlikely case, we may get 10 YES tion 2.1 to compute the errors and costs at each point, and the over- and 10 NO votes, which may mean we are not sure if the photo sat- all error and cost. Note that we do not have to compute the p values isfies the filter. In this case we may prefer to stop asking questions at all points, but only for the reachable points, as all other points to contain costs and make a decision, even though the error rate have zero error E(x, y) and cost C(x, y). at this point will be high no matter what we decide. In Section 7 we study the price one pays (number of questions) for adopting 3. DETERMINISTIC STRATEGIES the more conservative approach of Problem 2 under the same error This section develops algorithms that find effective deterministic threshold. strategies for Problem 1. Instead of minimizing the overall cost, one could minimize the maximum cost at any given point, as in the next problem: 3.1 The Paths Principle P ROBLEM 3 (M AXIMUM C OST ). Given an error threshold τ , Given a deterministic strategy, using Equations 1, 2, 3 and Fig- find a strategy that minimizes the maximum value of C(x, y) (over ure 2 given earlier, it is easy to see that the following theorem holds: all points), under the constraint E < τ . T HEOREM 3.1 (C OMPUTATION OF C OST AND E RROR ). The expected cost and error of a strategy can be computed in time pro- Another variation is to minimize error, given some constraint on portional to the number of reachable grid points. the number of questions. We specify two such variants next. In the first variant we have a maximum budget for each item we want to Based on the above theorem, we have a brute force algorithm to filter, while in the second variant we have an additional constraint find the best deterministic strategy, namely by examining strategies on the expected overall cost. corresponding to all possible assignments of Pass, Fail or Continue (i.e., continue asking questions) to each point in (0, 0) to (m, m). P ROBLEM 4 (E RROR ). Given a budget threshold per item m, (There are 3m such assignments.) Evaluating cost and error for find a strategy that minimizes E under the constraint ∀(x, y) C(x, y) < each strategy takes time O(m2 ) using the recursive equations. We m. select the one that satisfies the error threshold, and minimizes cost. We call this algorithm naive3. P ROBLEM 5 (E RROR - II). Given a budget threshold per item Note that some of these strategies are not terminating. However, m and a cost threshold α, find a strategy that minimizes E under termination can also be checked easily for each strategy considered the constraints ∀(x, y) C(x, y) < m and C < α. in time proportional to the number of reachable grid points in the strategy. (If the p0 and p1 values at all points on the line x + y = 2.4 Computing Probabilities at Grid Points m , for some m ≤ m is zero, then the strategy is terminating.) In this subsection we show how to compute the p0 and p1 values defined in the previous subsection. We focus on a deterministic T HEOREM 3.2 (B EST S TRATEGY: NAIVE 3). The naive3 al- strategy (probabilistic strategies are discussed in Section 4). gorithm finds the best strategy for Problem 1 in O(m2 3m ). We compute the p values recursively, starting at the origin. In particular, note that p0 (0, 0) is (1 − s) and p1 (0, 0) is s. (For We are able to reduce significantly the search space of the naive instance, p0 (0, 0) is the probability that the item does not satisfy algorithm by excluding the provably suboptimal strategies. Our the filter and the strategy visits the point (0, 0) — which it has to.) exclusion criterion is based on the following fundamental theorem. We can then derive the probability p0 and p1 for every other point T HEOREM 3.3 (PATHS P RINCIPLE ). Given s, e1 , e0 , for ev- in the grid as shown in Figure 2. (Recall that Term(x, y) means ery point (x, y), the function R(x, y) = p0 (x, y)/(p0 (x, y) + that (x, y) is a termination point, either Pass or Fail.) p1 (x, y)) is a function of (x, y), independent of the particular (de- To see how these equations work, let us consider the first case for terministic or probabilistic) strategy. p0 , where neither (x − 1, y) nor (x, y − 1) are termination points. Note that we can get to (x, y) in two ways, either from (x, y − 1) P ROOF. Consider a single sequence of x No answers and y Yes on getting an extra YES, or from (x − 1, y) on getting an extra NO. answers. The probability that an item satisfies the filter, and gets the Thus, the probability of an item not satisfying the filter and getting particular sequence of x No answers and y Yes answers is precisely

5. Figure 3: (a) A Shape (b) Triangular Strategy Corresponds to a Shape (c) Ladder Shape Pruning a = s × ex1 × (1 − e1 )y , while the probability that an item does not 3.2 Shapes and Ladders satisfy the filter and gets the same sequence is b = (1 − s) × ey0 × In practice, considering all 2m strategies is computationally fea- (1 − e0 )x . The choice of strategy may change the number of such sible only for very small m. In this section, we design algorithms paths to the point (x, y), however the fraction of p0 to p0 + p1 is that only consider a subset of these 2m strategies and thereby can still b/(a + b). (Note that each path precisely adds a to p1 and b to only provide an approximate solution to the problem (i.e., the ex- p0 . Thus, if there are r paths, p0 /(p0 + p1 ) = (r × b)/[r × (a + pected cost may be slightly greater than the expected cost of the b)] = b/(a + b).) It can be shown that the proof also generalizes to optimal solution). The subset of strategies that we are interested in probabilistic strategies. (In that case, r can be a fractional number are those that correspond to shapes. We will describe an efficient of paths.) way of obtaining the best strategy that corresponds to a shape. Intuitively, this theorem holds because the strategy only changes the number of paths leading to a point, but the characteristics of the Shapes: A shape is defined by a connected sequence of (horizontal point stay the same. or vertical) segments on the grid, beginning at a point on the y- Using the previous theorem, we have the result that in order to axis, and ending at a point on the x axis, along with a special point reduce the error, for every termination point (x, y), Passing or Fail- somewhere along the sequence of segments, called a decision point. ing is independent of strategy, but is based simply on R(x, y). We also assume that each segment intersects at most with two other lines, namely the ones preceding and following it in the sequence. T HEOREM 3.4 (F ILTERING I NDEPENDENT OF S TRATEGY ). As an example, consider the shape in Figure 3(a) (ignore the dashed For every optimal strategy, for every point (x, y), if Term(x, y) lines for now) This shape begins at (0, 4), has a sequence of 14 holds, then: segments, and ends at (4, 0). The decision point (not shown in the • If R(x, y) > 1/2, then Fail(x, y) figure) is (for example) at (5, 5). As seen in the figure, the segments are allowed to go in any direction (up/down or left/right). • If R(x, y) < 1/2, then Pass(x, y) Each shape corresponds to precisely one strategy, namely the one P ROOF. Given a strategy, let there be a point (x, y) for which defined as follows: Pass(x, y) holds, but R(x, y) > 1/2. Let the error be: • For each point in the sequence of segments starting at the point on the y axis, until and including the decision point, E = E0 + E(x, y)(p0 (x, y) + p1 (x, y)) we color the point blue (i.e., we designate the point as Pass). (We split the error into two parts, one dependent on other termi- In the figure, all points in the sequence of segments starting nating points, and one just dependent on (x, y).) Currently, since from (0, 4) until and including (5, 5) are colored blue. (x, y) is Pass, E(x, y) is p0 (x, y)/(p0 (x, y) + p1 (x, y)). Thus, • For each point in the sequence of segments starting at the E = E0 + p0 (x, y). If we change (x, y) to Fail, we get E = point after the decision point, until and including the point E0 + p1 (x, y), then E < E. Thus, by flipping (x, y) to Fail, on the x axis, we color the point red (i.e., we designate the we can only reduce the error. Similarly, if R(x, y) < 1/2 and point as Fail). In the figure, all points after (5, 5) on the Fail(x, y) holds, we can only reduce the error by flipping it to sequence of segments are colored red. Pass(x, y). • For all the points inside or on the shape that are reachable, Thus, we only need to consider 2m strategies, namely those where we color them green; else we color them white. (Some points for each point, we can either set it to be a continue point or a termi- colored blue or red previously may actually be unreachable nation point, and if it is a termination point, then using the previous and will be colored white in this step.) In the figure, some theorem, we can infer whether it should be Pass or Fail.1 The algo- of the blue points, such as (2, 5), (1, 5), (1, 6), . . ., (5, 5), rithm that considers all such 2m strategies is called naive2. Thus, and some of the red points (5, 4), . . ., (4, 3), and (4, 1) are we have the following theorem: colored white, while some of the unreachable points inside the shape, such as (3, 4) and (4, 4) are colored white as well. T HEOREM 3.5 (B EST S TRATEGY: NAIVE 2). The naive2 al- The reachable points inside the shape, like (1, 1) or (2, 3) are gorithm finds the best strategy for Problem 1 in O(m2 2m ). colored green. 1 Note that there is a subtle point here. When R(x, y) = 1/2 then The strategies that correspond to shapes form a large and diverse the point can either be a Pass or a Fail point, with equal effect. (It class of strategies. In particular, the triangular strategy and rect- will contribute the same amount of error either way.) For simplicity angular strategy both correspond to shapes. Consider Figure 3(b), we assume it to always return Pass if this holds. which depicts a shape corresponding to the triangular strategy of