introduction to Crowddb

1.CrowdDB : Answering Queries With Crowdsourcing Presented by: Fareedah ALSaad 1 Human in the Loop

2.Motivation Relational database systems are based on the “Closed World Assumption ”. R elational databases are extremely literal. SELECT market_capitalization FROM company WHERE name = "I.B.M." ; Entity Resolution Problem RDMS cannot deal with subjective comparison. How to leverage human resources to extend the capabilities of database systems? 2 Human in the Loop

3.Motivation Develop CrowdDB : A relational query processing system  maintain SQL semantics . Rely on traditional RDBS to do the heavy lifting data manipulation. Extend SQL to enable queries that involve human computation. 3 Human in the Loop

4.Crowdsourcing Platform Amazon Mechanical Turk (AMT). AMT basics: HIT (Human Intelligent Task) : Smallest entity of work  one or more job, e.g. tagging 5 pictures. Assignment: A HIT is replicated into multiple assignments for majority votes. HIT Group : A group of similar HITs. 4 Human in the Loop

5.Crowdsourcing Platform Amazon Mechanical Turk (AMT). AMT APIs: createHIT (title, description, question, keywords, reward, duration, maxAssignments , lifetime)  HitID getAssignmentsForHIT ( HitID )  list ( asnId , workerId , answer) approveAssignment ( asnID ) rejectAssignment ( asnID ) forceExpireHIT ( HitID ) 5 Human in the Loop

6.CrowdDB Design Considerations Performance and Variability : People and machines differ in speed, cost, and quality. People show tremendous variability. Task Design and Ambiguity : Ambiguities due to natural language. Interface design can affect the accuracy and the speed. Affinity and Learning : C rowd workers develop relationships with requesters and skills for certain HIT types. Relatively Small Worker Pool . Open vs. Closed World . 6 Human in the Loop

7.Overview of CrowdDB 7 An application issues requests using CrowdSQL . T he complexities of dealing with the crowd are encapsulated by CrowdDB . Results obtained from the crowd can be stored in the database for future use. Human in the Loop

8.CrowdSQL – Incomplete Data Use special keyword: CROWD . Incomplete data can occur in two flavors : Crowdsourced Column: CREATE TABLE Department ( university STRING, name STRING, url CROWD STRING, phone STRING, PRIMARY KEY (university, name) ); Crowdsourced Table: CREATE CROWD TABLE Professor ( name STRING PRIMARY KEY, email STRING UNIQUE, university STRING, department STRING, FOREIGN KEY (university, department) REF Department(university, name) ); 8 Human in the Loop

9.CrowdSQL – Incomplete Data Use new value type CNULL to indicates that a value should be crowdsourced when it is first used . CNULL is the default value of any CROWD column. CNULL values are generated as a side-effect of INSERT statements: INSERT INTO Department(university, name) VALUES ("UC Berkeley", "EECS") ; 9 university name url phone UC Berkeley EECS CNULL NULL Human in the Loop

10.CrowdSQL – Incomplete Data Use new value type CNULL to indicates that a value should be crowdsourced when it is first used . CNULL is the default value of any CROWD column. CNULL values are generated as a side-effect of INSERT statements: INSERT INTO Department(university, name) VALUES ("UC Berkeley", "EECS") ; Allow crowdsourcing as a side-effect of query processing: SELECT url FROM Department WHERE name = "Math"; 10 Human in the Loop

11.CrowdSQL – Subjective Comparisons Use two new built in functions: CROWDEQUAL,CROWDORDER. CROWDEQUAL (~=) SELECT profile FROM department WHERE name ~= "CS"; CROWDORDER CREATE TABLE picture ( p IMAGE, subject STRING); SELECT p FROM picture WHERE subject = "Golden Gate Bridge" ORDER BY CROWDORDER (p, " Which picture visualizes better %subject"); 11 Human in the Loop

12.CrowdSQL in Practice Practical issues that limit the usage of CrowdSQL : C ost and response time of queries can be unbounded. Lineage: track source of data to take actions. C leansing of crowdsourced data  entity resolution. Human in the Loop 12

13.User Interface Generation A utomatically generates user interfaces for incomplete information and subjective comparisons . Create templates at a compile-time. Templates are instantiated at a run-time for each tuple. Templates can be edited for customized instruction. 13 Human in the Loop

14.User Interface Generation Basic Interface: Two types of optimization: Batch several tuples. Prefetching of attributes of the same tuple. 14 Human in the Loop

15.User Interface Generation Multi- r elational interface: Foreign-key references a non- crowdsourced table: A drop-down box of possible foreign keys. Ajax-based “suggest” function. Foreign-key references a crowdsourced table: Normalized interface  suggest function can be used to avoid entity resolution problem. Denormalized interface. 15 Human in the Loop

16.Query Processing – Crowd Operators Three crowd operators: CrowdProbe : Crowdsources missing information of CROWD and new tuples. CrowdJoin : At least one table is a crowdsourced table. CrowdCompare : Implement the CROWDEQUAL and CROWDORDER function. Q uality control is carried out by a majority vote. 16 Human in the Loop

17.Query Processing – Physical Plan Generation 17 Heuristics: Simple rule-based optimizer: e.g. predicate push-down. Crowdsourcing rules: S et the basic crowdsourcing parameters (price, batching-size). Select the user interface (normalized vs. denormalized ). A cost-based optimize that considers the changing conditions on AMT, remains future work. Human in the Loop

18.Experiments and Results – Simple Queries Response Time, Vary HIT Group 18 Human in the Loop Within 30 min SELECT phone_number , address FROM businesses;

19.Experiments and Results – Simple Queries Responsiveness, Vary Rewards 19 Human in the Loop

20.Experiments and Results – Simple Queries Worker Affinity and Quality 20 Human in the Loop

21.Experiments and Results – Complex Queries Entities Resolution on Companies 21 Human in the Loop SELECT name FROM company WHERE name~=" [a non-uniform name of the company]"

22.Experiments and Results – Complex Queries Ordering Pictures 22 {# of votes by the workers, picture rank based on workers votes, picture rank ordered by experts} Human in the Loop

23.Experiments and Results – Complex Queries Joining Professors and Departments SELECT , , , FROM Professor p, Department d WHERE p.department = AND = AND = "[name of a professor]" Compare the performance of two plans: Two steps : collect professor information and then the department information. A single step : collect professor and department information together  use denormalized interface T wo plans were similar in execution time and cost. First plan (two steps) has better accuracy. In the second plan, workers submitted the professors’ phone numbers instead of the departments. 23 Human in the Loop

24.Observations C hallenges in controlling the factors that impact response time, cost and result quality. C rowd resources involve long-term memory that can impact performance  Keep workers happy. U ser interface design and precise instructions matter  a good interface improves result quality and worker efficiency. 24 Human in the Loop

25.Related Work Database Systems: L everages traditional techniques for relational query processing. Top N optimizations  to deal with the open-world nature of crowdsourcing. T he volatility of crowd performance  needs adaptive query processing techniques. A utomatic generation of user interfaces  similar to Oracle Forms. Crowdsourcing Communities: Ipeirotis analyzed the AMT marketplace by gathering some statistics. CrowdSearch attempts to automatically control quality and optimize response time. TurKit is a set of tools that enables programming iterative algorithms over the crowd. Quark and ( Parameswaran and Polyzotis , 2011): explore the use of crowdsourcing in relational query processing . 25 Human in the Loop

26.Thank You 26 Human in the Loop