# FlashExtract: A Framework for Data Extraction by Examples

Various document types that combine model and viewmake it easy to organize (possibly hierarchical) data, but make it difficult to extract raw data for any further manipulation or querying. We present a general framework FlashExtract to extract relevant data from semi-structured documents using examples. It includes: (a) an interaction model thatallows end-users to give examples to extract various fields and to relate them in a hierarchical organization using structure and sequence constructs. (b) an inductive synthesis algorithm to synthesize the intended program from few examples in any underlying domainspecific language for data extraction that has been built using our specified algebra of few core operators (map, filter, merge, and pair).

4. function Run (schema extraction program Q, schema M , document Fill Struct (id1 E1 , . . . , idn En ), R = new D) : schema instance is 1 CR := ∅ Struct {id1 = Fill(E1 , R), . . . , idn = Fill(En , R)} 2 foreach field f in M in top-down topological order do Fill Seq(f ), R = new 3 R˜ := Run(Q(f ), D, CR) ˜ Seq Map(λR : Fill(f, R ), Subregions(R, CR[f.Color]) 4 CR := CR ∪ {(f.Color, R) | R ∈ R} Fill([color] Val, R) = Subregion(R, CR[color]).Val 5 if CR is inconsistent with M then return ⊥ 6 else return Fill(M, D.Region) Fill([color] T, R) = Fill(T, Subregion(R, CR[color])) function Run (extraction program (f , P ) of field f , document D, Fill(_, ⊥) = ⊥ highlighting CR): f-regions is 7 R˜ := (f = ⊥)? {D.Region} : CR[f .Color] Figure 5: Semantics of Fill. 8 return P R /* execute P on R */ ˜ R ∈R D EFINITION 4 (Extraction Programs). A schema extraction pro- gram Q for a given schema M is represented as a map from each Algorithm 1: Execution semantics of extraction programs. field f in M to a field extraction program, denoted Q(f ). A field extraction program of field f is a pair (f , P ), where f We categorize ancestral relationship in order to invoke appropriate (possibly ⊥) is some ancestor field of f and P is either a SeqRegion synthesis algorithms. program that extracts a sequence of f -regions from inside a given f -region (in case f is a sequence-ancestor of f ), or is a Region Regions program that extracts a single f -region from inside a given f -region D EFINITION 2 (Region). A region R of a document is some two- (in case f is a struct-ancestor of f ). dimensional portion over the visualization layer of that document that the user is allowed to highlight in some color. We use the The execution semantics of a schema extraction program is notation f -region to denote any region that the user highlights in defined in Algorithm 1. FlashExtract executes the field extraction f.Color. Any region R that is associated with a leaf field (also program corresponding to each field f in a top-down order and referred to as a leaf region) has some value associated with it, updates the document highlighting CR using the returned list of f - which is denoted by R.Val. For a document D, we use the notation regions R˜ (lines 2–4). For each field f , it first finds the set of regions D.Region to denote the largest region possible in D. ˜ R determined by the ancestor f (line 7), and then computes all f -regions by executing the field extraction program on each region In case of text files, any region is represented by a pair of two ˜ (line 8). Once CR has been fully constructed, it generates a character positions within the file and consists of all characters in R in R between (these positions may or may not be within the same line). schema instance from the nesting relationship defined in the output The value of such a region is the string of all characters in between schema M , using the Fill function (line 6). those positions. Fig. 5 defines the semantics of Fill recursively. Each definition In case of webpages, a leaf region is represented by either an takes a schema construct and a region corresponding to one of HTML node (the value of such a region is the text value associated its ancestor fields, and returns a construct instance by recursively with that node) or a pair of character positions within the text applying Fill functions on its descendants. CR[c] returns all regions whose color is c. Subregions(R, R) ˜ returns the ordered set of content of an HTML node (the value of such a region is the string of all characters in between those positions). A non-leaf region is regions from R ˜ that are nested inside R. Subregion(R, R) ˜ returns represented by an HTML node. ˜ the region from R that is nested inside R; if no such region exists, In case of spreadsheets, a leaf region is represented by a single ⊥ is returned. Note that if CR is consistent with M , there is at most cell (and its value is the cell’s content), while a non-leaf region is one such region. We assume the presence of an API for checking represented by a pair of cells (and consists of the rectangular region the nestedness of two regions. determined by those cells). Example-based User Interaction D EFINITION 3 (Highlighting). A highlighting CR of a document D is a collection of colored regions in D. It can also be viewed as a Having defined all necessary concepts, we are now ready to discuss function that maps a color to all regions of that color in D. We say the way a user interacts with FlashExtract to extract their desired that a highlighting CR is consistent with a data scheme M if the data. The user first supplies the output data schema. Then, for each following conditions hold. field f in the schema (in an order determined by the user), the user simply provides sufficient number of examples of field instances of • For any two regions (in CR), either they don’t overlap or one is field f by highlighting appropriate regions in the document using nested inside the other. f.Color. Our user interface supports standard mouse click, drag, • For any two fields f1 and f2 in M such that f1 is an ancestor of and release gestures. f2 , each f2 -region R2 is nested inside some f1 -region R1 . When the user provides examples for a field f , FlashExtract • For any two fields f1 and f2 in M such that f1 is a struct- synthesizes a field extraction program for field f (using Algorithm 2) ancestor of f2 , there is at most one f2 -region inside a f1 -region. that is consistent with the provided examples, and executes it to • For every leaf field f in M , the value of any f -region in CR is of identify and highlight other regions in f.Color. (See Def. 5 for a type f . formal notion of consistency.) If the user is happy with the inferred highlighting, she can commit the results (the field f is said to Schema Extraction Program have been materialized at that point of time), and then proceed FlashExtract synthesizes extraction programs for individual fields to another (non-materialized) field. Otherwise, the user may provide and combines them into a schema extraction program following the any additional examples. structure imposed by the output schema. FlashExtract also leverages We say that a field f has been simplified if there exists a the schema’s structure to simplify the learning of individual fields. In materialized field f such that f is a structure-ancestor of f . The particular, it relates a field f to one of its ancestors, whose extraction examples for a non-simplified field consist of positive instances program (in case of a non-⊥ ancestor) defines learning boundaries and optionally negative instances of regions that lie completely for f (i.e., each f -region must reside inside one of these boundaries). within the regions of the immediate ancestor field that has been

5. function SynthesizeFieldExtractionProg (Document D, advantage. It allows FlashExtract to guess the organization of leaf Schema M , Highlighting CR, Field f , Regions R ˜ 1 , Regions R ˜ 2 ) is field instances by looking at their relative order (thereby obviating /* R ˜1 , R ˜ 2 denote positive, negative instances */ the need to provide examples for any non-leaf field.) While this 1 foreach ancestor field f of f in schema M do is successful in most cases, it may not be able to deal with cases 2 if f isn’t materialized ∧ f = ⊥ then continue where field instances may be null. On the other hand, iterating 3 R˜ := (f = ⊥)? {D.Region} : CR[f .Color] over fields in a top-down topological order requires the user to 4 if f is a sequence-ancestor of f then provide examples for each field (including non-leaf fields), but it 5 ex := ∅ offers three advantages: (a) it provides an easy visualization for 6 foreach R ∈ R ˜ s.t. Subregions(R, R ˜1 ∪ R ˜2 ) = ∅ the user to inspect the results of the organization of various non- do leaf field instances, (b) it provides greater chance of success since 7 ex := ex ∪ the synthesis task for a field can now be enabled relative to any {(R, Subregions(R, R ˜ 1 ), Subregions(R, R ˜ 2 ))} ancestor field as opposed to the entire document, (c) it may also entail having to provide fewer examples for any field that is nested 8 P˜ := SynthesizeSeqRegionProg(ex) inside another field whose instances have all been identified. Hence, 9 else /* f is a structure-ancestor of f */ if the leaf field instances are never null and the user does not need 10 ex := ∅ help with identifying representative examples, the user may simply 11 foreach R ∈ R˜ s.t. Subregion(R, R ˜ 1 ) = ⊥ do provide few examples for each leaf field and FlashExtract may be 12 ˜ ex := ex ∪ {(R, Subregion(R, R))} able to automatically infer the organization of the various leaf field P˜ := SynthesizeRegionProg(ex) instances. Otherwise, we recommend that the user iterates over fields in a top-down topological order. foreach P ∈ P˜ do CR := CR ∪ {(f.Color, R) | R ∈ P R , R ∈ R} ˜ 13 if CR is consistent with M then return (f , P ) 4. Inductive Synthesis Framework 14 return ⊥ In this section, we describe a general framework for developing the Algorithm 2: Synthesize a field extraction program. inductive synthesis APIs (namely, SynthesizeSeqRegionProg and SynthesizeRegionProg) that enable the example-based user interaction model discussed in the previous section. We build this materialized. If the user is not happy with the inferred highlighting, framework over the inductive synthesis methodology proposed by the user provides additional positive instances (of regions that Gulwani et.al.  of designing appropriate DSLs and developing FlashExtract failed to highlight) or negative instances (of unintended algorithms to synthesize programs in those DSLs from examples. regions that FlashExtract highlighted) and the synthesis process is However, we go one step further. We identify an algebra of core repeated. The examples for a simplified field consist of at most operators that can be used to build various data extraction DSLs for a single positive instance (possibly null) inside each region of the various document types (§4.2). We also present modular synthesis immediate ancestor field that has been materialized. If the user is not algorithms for each of these operators in terms of the synthesis happy with the inferred highlighting, the user provides additional algorithms for its (non-atomic) arguments (§4.3)—this enables examples and the synthesis process is repeated. automatic generation of synthesis algorithms for any DSL that is The example-based interaction is enabled by the procedure Syn- constructed using our algebra of core operators. We start out by thesizeFieldExtractionProg described in Algorithm 2, which formalizing the notion of a data extraction DSL. takes as input a document D, a schema M , a highlighting CR of the document that is consistent with M , a non-materialized field ˜ 1 , and a set of negative instances 4.1 Data Extraction DSLs f , a set of positive instances R R˜ 2 (which is empty in case field f has been simplified). The pro- A data extraction DSL is represented by a tuple (G, N1 , N2 ). G cedure SynthesizeFieldExtractionProg returns a program P is a grammar that defines data extraction strategies. It contains such that (a) P is consistent with the examples and (b) the updated definitions for various non-terminals N . Each non-terminal N is highlighting that results from executing P is consistent with the defined as a ranked collection of rules (also referred to as N.RHSs) schema M . Line 2 finds a suitable ancestor f from CR that forms of the same type. The type of a non-terminal is the type of its rules. the learning boundary for f . The loops at lines 6 and 11 group the in- A rule consists of a fixed expression or an operator applied to other put examples into boundaries imposed by f -regions. Depending on non-terminals of appropriate types or fixed expressions. The type of the relationship between f and f , FlashExtract invokes an appropri- a rule is the return type of the fixed expression or the operator that ate API provided by our inductive synthesis framework. In particular, constitutes the rule. it invokes SynthesizeSeqRegionProg (line 8) to learn a program We say a non-terminal is synthesizable if each of its rules for extracting a sequence of f -regions in an f -region (if f is a either (a) involves an operator from our core algebra applied to sequence-ancestor of f ), or SynthesizeRegionProg (line 12) to fixed expressions or synthesizable non-terminals, or (b) involves an learn a program for extracting a single f -region in an f -region (if operator that is equipped with an inductive synthesis algorithm of f is a structure-ancestor of f ). Both SynthesizeSeqRegionProg its own (i.e., domain-specific operators), or (c) fixed expressions. and SynthesizeRegionProg actually return a sequence of pro- N1 is a distinguished (top-level) synthesizable non-terminal of grams of the right type. The loop at line 12 selects the first program type sequence of regions. N2 is another distinguished (top-level) P in this sequence (if it exists) that ensures that the updated high- synthesizable non-terminal of type region. lighting that results from executing P is consistent with the schema An expression generated by a non-terminal of type T can be M . We describe this framework and its APIs in the next section. viewed as a program with return type T . Note that the expressions An interesting aspect of the above-mentioned interaction is the generated by N1 represent SeqRegion programs and expressions order in which the user iterates over various fields. FlashExtract is generated by N2 represent Region programs. The DSL expressions flexible enough to let users extract various fields in any iteration may involve one distinguished free variable R0 (of type Region) that order. This is especially useful when the user dynamically decides denotes the input to the top-level SeqRegion or Region programs. to update the data extraction schema (e.g., extract more fields). Any other free variable that occurs in a DSL expression must be Iterating over fields in a bottom-up ordering offers an interesting bound to some lambda definition that occurs in a higher level expression.

6. A state σ of a program P is an assignment to all free variables in Merge Operator A Merge operator takes as input a set of n P . We use the notation {x ← v} to create a state that maps variable sequence expressions, each of which is generated by the same x to value v. We use the notation σ[v/x] to denote setting the value non-terminal A (of some type of the form List T ). The return of variable x to value v in an existing state σ. We use the notation value of Merge also has the same type as that of A. The Merge P σ to denote the result of executing the program P in state σ. operator combines the results of these n expressions together—this Next, we discuss the core operators in our algebra that can be is useful when a single expression cannot extract all intended regions. used to build data extraction DSLs. This operator is a disjunctive abstraction and allows extraction of multiple-format field instances by merging several single-format 4.2 Core Algebra for Constructing Data Extraction DSLs field instances. Its semantics is as follows: Our core algebra is based around certain forms of map, filter, Merge(A1 , . . . , An ) σ = MergeSeq( A1 σ, . . . , An σ) merge, and pair operators. The pair operator (which returns a scalar) constitutes a scalar expression, while the other operators (which The MergeSeq operation merges its argument sequences with return a sequence) constitute a sequence expression. respect to the order of their elements’ locations in the original file. Decomposable Map Operator A Map operator has two arguments Pair Operator A Pair operator has two arguments A and B and λx : F and S, where S is a sequence expression of type List T has the following standard pair operator semantics. and F is some expression of type T and uses an additional free Pair(A, B) σ = ( A σ, B σ) variable x. The return type of Map is List T . Map(λx : F, S) has the standard semantics, wherein it applies function F to each The pair operator allows constructing region representations from element of the sequence produced by S to construct the resultant smaller elements. For example, we can create a text region from a sequence. pair of its start and end positions. Map(λx : F, S) σ = [t0 , . . . , tn ], where 4.3 Modular Synthesis Algorithms n = |Y − 1|, ti = F (σ[Y [i]/x]), Y = S σ The API SynthesizeSeqRegionProg takes as input a set of ex- We say that a Map operator is decomposable (w.r.t. the underlying amples, each of which consists of a triple (R, R˜1 , R˜2 ), where R DSL, which defines the language for F and S) if it has the following denotes the input region, R˜1 denotes positive instances of the re- property: For any input state σ and a sequence Y , there exists a gions that should be highlighted within R, and R˜2 denotes negative sequence Z such that instances of the regions that should not be highlighted within R. ∀F, S : Y Map(F, S) σ =⇒ Z S σ ∧ Map(F, Z) σ = Y The API SynthesizeRegionProg takes as input a set of examples, where denotes the subsequence relationship. Let Decompose be each of which consists of a pair (R, R ), where R denotes the input a function that computes such a witness Z, given σ and Y . It has region and R denotes the output region. Both these APIs return an the following signature: ordered set of programs in the DSL, each of which is consistent with the provided examples. Fig. 6 contains the pseudo-code for these Map.Decompose : (Region × List T ) → List T APIs, which we explain below. The Decompose function facilitates the reduction of examples for The method SynthesizeSeqRegionProg first learns from only Map operator to examples for its arguments F and S, thereby positive instances by invoking the learn method of the top-level reducing the task of learning the desired Map expression from sequence non-terminal N1 (line 2) and then selects those programs examples to the sub-tasks of learning F and S expressions from that additionally satisfy the negative instances constraint (loop at respective examples. line 4). The operator :: appends an element to a list. The method SynthesizeRegionProg simply invokes the learn method of the Filter Operators Our algebra allows two kinds of filter operators top-level region non-terminal N2 (line 10). over sequences, one that selects elements based on their properties The learn method for any non-terminal N simply involves (FilterBool), and the other one that selects elements based on invoking the learn method associated with its various rules (line 13) their indexes (FilterInt). and then returning the ordered union of the sequences of the A FilterBool operator has two arguments λx : B and programs synthesized from each. The operator + + performs list S, where S is a sequence expression of type List T and B concatenation. is a Boolean expression and uses an additional free variable x. We next briefly describe the key insights behind the learn The return type of FilterBool is List T . FilterBool(λx : methods for the operators that constitute the various rules. The F, S) has the standard filtering semantics: it selects those el- higher level key idea is to define them in terms of the learn methods ements from S that satisfy B. For example, if S is the set of their arguments. This allows for a free synthesis algorithm for of all lines in Ex. 1, then the expression FilterBool(λx : any data extraction DSL. EndsWith([Number, Quote], x), S) selects all yellow lines. The predicate EndsWith([Number, Quote], x) returns true iff the string Learning Decomposible Map Operator The key idea here is x ends with a number followed by a double quote. to first find the witness Zj for each example (σj , Yj ) using the A FilterInt operator has three arguments: a non-negative operator’s Decompose method (line 16). This allows us to split the integer init, a positive integer iter, and a sequence expression S. problem of learning a map expression into two independent simpler Its return value also has the same type as that of S. The FilterInt sub-problems, namely learning of a scalar expression F (line 18), operator takes every iter elements from S starting from init as and learning of a sequence expression S (line 20) from appropriate the first element. Its semantics is as follows: examples. The returned result is an appropriate cross-product style composition of the results returned from the sub-problems (line 25). FilterInt(init, iter, S) σ = let L = S σ in Filter λx : (indexof(L, x) − init)%iter = 0, L Learning Merge Operator The key idea here is to consider all (minimal) partitioning of the examples such that the learn method of For example, FilterInt(1, 2, S) selects all elements at odd indices the argument for each partition returns a non-empty result. The set T from a sequence. (at line 28) holds all such minimal partitions. For each such partition The two kinds of filter operators can be composed to enable (loop at line 30), we invoke the learn method of the argument for sophisticated filtering operations. each class in the partition (line 31), and then appropriately combine

7. function SynthesizeSeqRegionProg ( function FilterBool.Learn (Set (State, List T ) Q) : Set (Region, List Region , List Region ) Q) : List P rog is List P rog is 1 Q := {({R0 ← R}, R˜1 ) | (R, R˜1 , R˜2 ) ∈ Q} /* Let B and S be the predicate and sequence /* learn with positive examples */ arguments of FilterBool. */ 2 P˜ := N1 .Learn(Q ) /* start symbol for sequence */ 35 P˜1 := S.Learn(Q) /* learn sequence S */ 36 Q := {(σ[Y [i]/x], True) | (σ, Y ) ∈ Q, 0 ≤ i < len(Y )} 3 P˜ := [] 37 P˜2 := B.Learn(Q ) /* learn filter B */ 4 foreach P ∈ P˜ do 5 if (∃(R, R˜1 , R˜2 ) ∈ Q : ( P {R0 ← R}) ∩ R˜2 = ∅) then 38 P˜ := [] 6 continue /* P violate negative instances */ 39 foreach P1 ∈ P˜1 do 40 foreach P2 ∈ P˜2 do 7 P := P˜ :: P ˜ 41 P˜ := P˜ :: “FilterBool(P1 , P2 )” 8 return P˜ 42 return CleanUp(P˜ , Q) function SynthesizeRegionProg (Set (Region, Region) Q) : List P rog is function FilterInt.Learn (Set (State, List T ) Q) : 9 Q := {({R0 ← R}, R ) | (R, R ) ∈ Q} List P rog is 10 return N2 .Learn(Q ) /* start symbol for region */ /* Let S be the sequence argument of FilterInt. */ Let Q be {(σj , Yj )}1≤j≤m function N.Learn(Set (State, T ) Q) : List P rog is 43 P˜1 := S.Learn(Q) /* learn sequence S */ 11 P˜ := [] 44 P˜ := [] 12 foreach C ∈ N.RHSs do 45 foreach P1 ∈ P˜1 do 13 P˜ := P˜ ++ C.Learn(Q) 46 init := ∞ 47 iter := 0 14 return P˜ 48 for j := 1 . . . m do 49 Zj := P1 σj function Map.Learn (Set (State, List T ) Q) : List P rog is 50 init := Min(init, indexof(Zj , Yj )) /* Let F and S be the function and sequence 51 for i := 0 . . . |Yj | − 2 do arguments of Map. */ 52 t := indexof(Zj , Yj [i + 1]) − indexof(Zj , Yj [i]) Let Q be {(σj , Yj )}1≤j≤m 53 if iter = 0 then iter := t 15 for j := 1..m do /* find witnesses Z */ 54 else iter := GCD(iter, t) 16 Zj := Map.Decompose(σj , Yj ) 17 Q1 := {(σj [Zj [i]/x], Yj [i]) | 0 ≤ i < len(Zj ), 1 ≤ j ≤ m} 55 if iter = 0 then iter := 1 18 P˜1 := F.Learn(Q1 ) /* learn Map’s function F */ 56 P˜ := P˜ :: “FilterInt(init, iter, P1 )” 19 Q2 := {(σj , Zj ) | 1 ≤ j ≤ m} 57 return CleanUp(P˜ , Q) 20 P˜2 := S.Learn(Q2 ) /* learn Map’s sequence S */ 21 P˜ := [] function Pair.Learn (Set (State, (T1 , T2 )) Q) : List P rog is 22 foreach P1 ∈ P˜1 do /* Let A and B be the two arguments of Pair. */ 23 foreach P2 ∈ P˜2 do Let Q be {(σj , (uj , uj ))}1≤j≤m 24 P˜ := P˜ :: “Map(P1 , P2 )” 58 Q1 := {(σj , uj )}1≤j≤m ; Q2 := {(σj , uj )}1≤j≤m 59 P˜1 := A.Learn(Q1 ) 25 return CleanUp(P˜ , Q) 60 P˜2 := B.Learn(Q2 ) 61 if P˜1 = ∅ or P˜2 = ∅ then return [] function Merge.Learn (Set (State, List T ) Q) : List P rog is /* Let A be the non-terminal that forms the 62 P˜ := [] arguments of Merge. */ 63 foreach P1 ∈ P˜1 do Let Q be {(σj , Yj )}1≤j≤m 64 foreach P2 ∈ P˜2 do /* X holds all possible subsets of examples */ 65 P˜ := P˜ :: “Pair(P1 , P2 )” 26 X := {Q | Q = {(σj , Yj )}1≤j≤m , 66 return P˜ ∀1 ≤ j ≤ m : Yj Yj , A.Learn(Q ) = []} 27 Y := Yj /* all positive examples */ (σj ,Yj )∈Q function CleanUp(List P rog P˜ , Set (State, List T ) Q) : List P rog is /* T includes partitions that cover all examples */ 67 P˜ := [] 28 T := X | X is a minimal subset of X s.t. 68 foreach i = 1 to |P˜ | do {Yj | (σj , Yj ) ∈ Q , Q ∈ X } = Y 69 P := P [i] 29 P˜ := [] 70 incl := true 30 foreach X ∈ T ordered by size do 71 foreach k = 1 to |P˜ | do Let Q1 , ..., Qn be the various elements of X 72 if (P˜ [k] subsumes P w.r.t. Q) and ((P does not subsume 31 P˜1 , . . . , P˜n := A.Learn(Q1 ), . . . , A.Learn(Qn ) P˜ [k] w.r.t. Q) or k < i) then incl := false 32 P˜ := P˜ ++ {“Merge(P1 , ..., Pn )” | ∀1 ≤ i ≤ n : Pi ∈ P˜i } 73 if (incl = true) then P˜ := P˜ :: P 33 return CleanUp(P˜ , Q) 74 return P˜ Figure 6: Modular inductive synthesis algorithm in FlashExtract.

8.the results. Although this algorithm is exponential in the size of T HEOREM 2 (Completeness). If there exists some program that is the input set, it works efficiently in practice because the number of consistent with the input set of examples, SynthesizeSeqRegion- examples required for learning is very small in practice. Prog (and SynthesizeRegionProg) produce one such program. Learning Filter Operators The key idea in the learn method for The proof of Theorem 2 follows from two key observations: (a) FilterBool is to independently learn an ordered set P˜1 of sequence The learn methods associated with the scalar non-terminals and expressions (line 35) and an ordered set P˜2 of Boolean expressions the (scalar) Pair operator satisfy a similar completeness property. (line 37) that are consistent with the given examples. The returned (b) The learn methods associated with the sequence non-terminals result is an appropriate cross-product style composition of the sub- and the sequence operators of the core algebra satisfy a stronger results P˜1 and P˜2 . completeness theorem stated below (Theorem 3). The key idea in the learn method for FilterInt is to first learn an ordered set P˜ of sequence expressions (line 43) that are consistent T HEOREM 3 (Strong Completeness). The learn methods associ- with the given examples. Then, for each such program, we learn ated with the sequence non-terminals and the sequence operators the most strict filtering logic that filters as few elements as possible of the core algebra (namely Map, FilterBool, FilterInt, and Merge) while staying consistent with the examples. In particular, we select satisfy the following property: “For every sequence program P that init to be the minimum offset (across all examples) of the first is consistent with the input set of examples, there exists a program element in Yj in the sequence Zj returned by executing the sequence P in the learned set of programs that subsumes P .” program in the example state σj (line 50). We select iter to be the The proof of Theorem 3 follows from induction on the DSL’s GCD of all distances between the indices of any two contiguous structure. Note that the CleanUp optimization only removes those elements of Yj in Zj (line 54). (lower-ranked) programs that are subsumed by other programs. Learning Pair Operator The key idea here is to invoke the learn method of the first (second) argument at line 59 (line 60) to learn 5. Instantiations programs that can compute the first (second) element in the various output pairs in the examples from the respective inputs. The final We now present instantiations of our general framework to three result is produced by taking a cross-product of the two sets of different data extraction domains: text files, webpages, and spread- programs that are learned independently (loop at line 63). sheets. For each domain, we define the notion of a region, the do- main’s underlying data extraction DSL, and discuss the implementa- CleanUp Optimization An important performance and ranking tion of its domain-specific learn methods. optimization employed by the learn methods of various operators is use of the CleanUp method, which removes those programs that 5.1 Text Instantiation extract more regions than some retained program. More precisely, A region in this domain is a pair of character positions in the input this method removes each of those (lower-ranked) programs from text file. an ordered set of programs that is subsumed by some unremoved program (See Def. 6). Note that this does not affect the completeness Language Ltext Fig. 7 shows the syntax of Ltext , our data extrac- property associated with the various learning methods (Th. 3). Fur- tion DSL for this domain. The core algebra operators are in bold. thermore, it implements an important ranking criterion that assigns We name the various Map operators differently in order to associate higher likelihood to the scenario wherein the user provides consec- different Decompose methods with them. The non-terminal N1 is utive examples from the beginning of any immediately enclosing a Merge operator over constituent sequence expressions SS. The ancestral region (as opposed to providing arbitrary examples). non-terminal N2 is defined as a Pair operator over two position expressions. 4.4 Correctness The position expression Pos(x, p) evaluates to a position in We now describe the correctness properties associated with our two string x that is determined by the attribute p (inspired by a similar key synthesis APIs: SynthesizeSeqRegionProg and Synthesiz- concept introduced in ). The attribute p is either an absolute eRegionProg. First, we state some useful definitions. position k, or is the kth element of the position sequence identified by the regex pair rr which consists of two regexes r1 and r2 . The D EFINITION 5. (Consistency) A scalar program P (i.e., a program selection order is from left-to-right if k is positive, or right-to-left if that returns a scalar value) is said to be consistent with a set k is negative. The position sequence identified by (r1 , r2 ) in string Q = {(σj , uj )}j of scalar examples if ∀j : uj = P σj . A x, also referred to as PosSeq(x, rr), is the set of all positions k sequence program P (i.e., a program that returns a sequence) in x such that (some suffix of the substring on) the left side of k is said to be consistent with a set Q = {(σj , Yj )}j of sequence matches with r1 and (some prefix of the substring on) the right side examples with positive instances if ∀j : Yj ⊆ P σj . A sequence of k matches with r2 . A regex r is simply a concatenation of (at program P is said to be consistent with a set Q = {(σj , Yj , Yj )}j most 3) tokens. A token T is a pre-defined standard character class of sequence examples with positive and negative instances if ∀j : such as alphabets, digits, colon character, etc. (We used 30 such (Yj ⊆ P σj ∧ Yj ∩ P σj = ∅). tokens in our instantiation). We also define some context-sensitive D EFINITION 6. (Subsumption) Given a set Q = {(σj , Yj )}j of tokens dynamically based on frequently occurring string literals in sequence examples with positive instances, and two sequence pro- the neighborhood of examples highlighted by the user. For instance, grams P1 , P2 that are consistent with Q, we say that P1 subsumes in Ex. 1, our dynamically learned tokens include the string “DLZ P2 w.r.t. Q if ∀j : P1 σj ⊆ P2 σj . - Summary Report” (which is useful for learning the green outer structure boundary) and the string “"Sample ID:,""” (which is useful The following two theorems hold. to extract the orange sample ID). The first rule of SS consists of a Map operator LinesMap that T HEOREM 1 (Soundness). The programs P returned by Synthe- maps each line of a line sequence LS to a pair of positions within sizeSeqRegionProg and SynthesizeRegionProg are consis- that line. The Decompose method for LinesMap takes as input a tent with the input set of examples. region R and a sequence of position pairs and returns the sequence The proof of Theorem 1 follows easily by induction (on the structure of lines from R that contain the corresponding position pairs. of the DSL) from similar soundness property of the learn methods The second (third) rule of SS pairs each position x in a position associated with the non-terminals and core algebra operators. sequence P S with a position that occurs somewhere on its right (left)

9. Disjuctive Pos Pair Seq N1 ::= Merge(SS1 , . . . , SSn ) Disjuctive Seq N1 ::= Merge(N S1 , . . . , N Sn ) Pos Pair Region N2 ::= Pair(Pos(R0 , p1 ), Pos(R0 , p2 )) | Merge(SS1 , . . . , SSn ) Pair Seq SS ::= LinesMap(λx : Pair(Pos(x, p1 ), Pos(x, p2 )), LS) Region N2 ::= XPath | Pair(Pos(R0 , p1 ), Pos(R0 , p2 )) | StartSeqMap(λx : Pair(x, Pos(R0 [x : ], p)), P S) Node Seq N S ::= XPaths | EndSeqMap(λx : Pair(Pos(R0 [ : x], p), x), P S) Pos Pair Seq SS ::= Line Seq LS ::= FilterInt(init, iter, BLS) SeqPairMap(λx : Pair(Pos(x.Val, p1 ), Pos(x.Val, p2 )), ES) Bool Line Seq BLS ::= FilterBool(b, split(R0 , ‘\n’)) | StartSeqMap(λx : Pair(x, Pos(R0 [x : ], p)), P S) Position Seq P S ::= LinesMap(λx : Pos(x, p), LS) | EndSeqMap(λx : Pair(Pos(R0 [ : x], p), x), P S) | FilterInt(init, iter, PosSeq(R0 , rr)) Element Seq ES ::= FilterInt(init, iter, XPaths) Predicate b ::= λx : True Position Seq P S ::= FilterInt(init, iter, PosSeq(R0 , rr)) | λx : {Starts,Ends}With(r, x) | λx : Contains(r, k, x) Figure 8: The syntax of Lweb , the DSL for extracting webpages. | λx : Pred{Starts,Ends}With(r, x) | λx : PredContains(r, k, x) Definitions of p and rr are similar to those in Fig. 7. | λx : Succ{Starts,Ends}With(r, x) | λx : SuccContains(r, k, x) Position Attribute p ::= AbsPos(k) | RegPos(rr, k) them to dynamic tokens. The PosSeq operator returns the sequence of all end positions of the magenta sequence (since each of these Regex Pair rr ::= (r1 , r2 ) Regex r ::= T {0, 3} have an r1 match on the left and an r2 match on the right). Note that Token T ::= C+ | DynamicToken there are other positions that either have an r1 match on the left (such as the position before the number in "Sample ID:;""5007-01"""), or Figure 7: The syntax of Ltext , the DSL for extracting text files. have an r2 match on the right (such as the position after the character L in ""ug/L"",0.0009), but not both; hence, these positions are not side. The notation R0 [x : ] (R0 [ : x]) denotes the suffix (prefix) of selected by the PosSeq operator. Since FilterInt does not filter the text value represented by R0 starting (ending) at position x. The any elements, P S is the same sequence returned by the regex pair. Decompose method associated with StartSeqMap (EndSeqMap) The map function in EndSeqMap takes each end position in P S takes as input a region R and a sequence of positions and maps each and finds its corresponding start position specified by p, which is position k in the input sequence to the string R[k : ] (R[ : k]). the first position from the right (k = -1) that matches the dynamic The line sequence non-terminal LS uses a nested combination token (,"") on the left side. The result is the magenta sequence. of FilterInt and FilterBool. The various choices for predicate b (used in FilterBool) have the expected semantics. For exam- E XAMPLE 6. If the magenta field is wrapped within the yellow ple, StartsWith(r, x) asserts if line x starts with regex r, while structure, one of its extraction programs is as follows: Contains(r, k, x) asserts if line x contains k occurrences of regex Pair(Pos(R0 , p1 ), Pos(R0 , p2 )), where r. We also take hints from preceding and succeeding lines via Pred* p1 = [DynamicTok(,"")], , 1 , p2 = , [DynamicTok("",)], 1 and Succ* predicates. For example, PredStartsWith(r, x) asserts Since the yellow field is the structure-ancestor of the magenta if the line preceding x in the input text file ends with regex r. field, FlashExtract learns a Pair operator to extract a magenta The position sequence non-terminal P S includes expressions region within a yellow region. The start position of this pair is the that select a position within each line of a line sequence (using the first position from the left (k = 1) that matches (,"") on the left LinesMap operator) or that filter positions returned by the PosSeq side (r1 ), and the end position is the first position from the left that operator (using the FilterInt operator). matches ("",) on the right side (r2 ). This program is simpler than the one in Ex. 5, because it exploits the separation determined by E XAMPLE 4. Below is a program in Ltext for extracting the yellow the program for the yellow field. regions in Ex. 1 (from the top-level region of the entire file). LinesMap(λx : Pair(Pos(x, p1 ), Pos(x, p2 )), LS), where Domain-Specific Learn Methods The learning of Boolean expres- p1 = AbsPos(0), p2 = AbsPos(−1), sion b is performed using brute-force search. The learning of position LS = FilterInt(0, 1, attribute expressions p is performed using the technique described FilterBool(λx : EndsWith([Number, Quote], x), split(R0 , ‘\n’))) in prior work . The FilterBool operator takes all the lines in the document 5.2 Webpage Instantiation and selects only those that end with a number and a quote. The A region in this domain is either an HTML node, or a pair of FilterInt operator does not do any filtering (init = 0, iter = 1); character positions within the text property of an HTML node. it simply passes the result of FilterBool to LS. The map function in LinesMap returns the entire input line (AbsPos (0) denotes the Language Lweb Fig. 8 shows the syntax of the DSL Lweb for beginning of the line, while AbsPos (-1) denotes the end of the line). extracting data from webpages. XPath (XPaths) denote an XPath The LinesMap operator thus returns a sequence identical to LS, expression that returns a single HTML node (a sequence of HTML which is the yellow sequence. nodes) within the input HTML node. Position attribute p and regex pair rr are similar to those in the text instantiation DSL Ltext . Our E XAMPLE 5. Below is a program for extracting the magenta re- token set additionally includes dynamic tokens that we create for gions in Ex. 1 (from the top-level region of the entire file). the various HTML tags seen in the input webpage. EndSeqMap(λx : Pair(Pos(R0 [ : x], p), x), P S), where The non-terminal N1 represents expressions that compute a p = RegPos ([DynamicTok(,"")], ), −1 sequence of HTML nodes or a sequence of position pairs within P S = FilterInt(0, 1, PosSeq(R0 , (r1 , r2 )), and HTML nodes. The non-terminal N2 represents expressions that r1 = [DynamicTok(,""), Word], compute a HTML node or a position pair within a HTML node. The r2 = [DynamicTok("",), Number, Comma], design of Lweb is inspired by the design of Ltext . HTML nodes in Lweb play a similar role to that of lines in Ltext . We use XPath FlashExtract recognizes the prefixes (,"") and suffixes ("",) of expressions to identify relevant HTML elements instead of using the given examples as frequently occurring substrings and promotes regular expressions to identify appropriate lines.