Dynamic Coattention Networks for Question Answering

Several deep learning models have been proposed for question answering. However, due to their single-pass nature, they have no way to recover from local maxima corresponding to incorrect answers. To address this problem, we introduce the Dynamic Coattention Network (DCN) for question answering. The DCN first fuses co-dependent representations of the question and the document in order to focus on relevant parts of both. Then a dynamic pointing decoder iterates over potential answer spans. This iterative procedure enables the model to recover from initial local maxima corresponding to incorrect answers. On the Stanford question answering dataset, a single DCN model improves the previous state of the art from 71.0% F1 to 75.9%, while a DCN ensemble obtains 80.4% F1.

1. Published as a conference paper at ICLR 2017 DYNAMIC C OATTENTION N ETWORKS FOR Q UESTION A NSWERING Caiming Xiong∗, Victor Zhong∗, Richard Socher Salesforce Research Palo Alto, CA 94301, USA {cxiong, vzhong, rsocher}@salesforce.com A BSTRACT arXiv:1611.01604v4 [cs.CL] 6 Mar 2018 Several deep learning models have been proposed for question answering. How- ever, due to their single-pass nature, they have no way to recover from local max- ima corresponding to incorrect answers. To address this problem, we introduce the Dynamic Coattention Network (DCN) for question answering. The DCN first fuses co-dependent representations of the question and the document in order to focus on relevant parts of both. Then a dynamic pointing decoder iterates over po- tential answer spans. This iterative procedure enables the model to recover from initial local maxima corresponding to incorrect answers. On the Stanford question answering dataset, a single DCN model improves the previous state of the art from 71.0% F1 to 75.9%, while a DCN ensemble obtains 80.4% F1. 1 I NTRODUCTION Question answering (QA) is a crucial task in natural language processing that requires both natural language understanding and world knowledge. Previous QA datasets tend to be high in quality due to human annotation, but small in size (Berant et al., 2014; Richardson et al., 2013). Hence, they did not allow for training data-intensive, expressive models such as deep neural networks. To address this problem, researchers have developed large-scale datasets through semi-automated techniques (Hermann et al., 2015; Hill et al., 2016). Compared to their smaller, hand-annotated counterparts, these QA datasets allow the training of more expressive models. However, it has been shown that they differ from more natural, human annotated datasets in the types of reasoning required to answer the questions (Chen et al., 2016). Recently, Rajpurkar et al. (2016) released the Stanford Question Answering dataset (SQuAD), which is orders of magnitude larger than all previous hand-annotated datasets and has a variety of qualities that culminate in a natural QA task. SQuAD has the desirable quality that answers are spans in a reference document. This constrains answers to the space of all possible spans. However, Rajpurkar et al. (2016) show that the dataset retains a diverse set of answers and requires different forms of logical reasoning, including multi-sentence reasoning. We introduce the Dynamic Coattention Network (DCN), illustrated in Fig. 1, an end-to-end neural network for question answering. The model consists of a coattentive encoder that captures the interactions between the question and the document, as well as a dynamic pointing decoder that alternates between estimating the start and end of the answer span. Our single model obtains an F1 of 75.9% compared to the best published result of 71.0% (Yu et al., 2016). In addition, our ensemble model obtains an F1 of 80.4% compared to the second best result of 78.1% on the official SQuAD leaderboard.1 ∗ Equal contribution 1 As of Nov. 3 2016. See https://rajpurkar.github.io/SQuAD-explorer/ for latest results. 1

2.Published as a conference paper at ICLR 2017 2 DYNAMIC C OATTENTION N ETWORKS Figure 1 illustrates an overview of the DCN. We first describe the encoders for the document and the question, followed by the coattention mechanism and the dynamic decoder which produces the answer span. Dynamic pointer decoder Coattention encoder start index: 49 end index: 51 steam turbine plants Document encoder Question encoder The weight of boilers and condensers generally makes the power-to-weight ... However, most What plants create most electric power is generated using steam turbine electric power? plants, so that indirectly the world's industry is ... Figure 1: Overview of the Dynamic Coattention Network. 2.1 D OCUMENT AND Q UESTION E NCODER Let (xQ Q Q 1 , x2 , . . . , xn ) denote the sequence of word vectors corresponding to words in the question and (x1 , x2 , . . . , xD D D m ) denote the same for words in the document. Using an LSTM (Hochreiter & Schmidhuber, 1997), we encode the document as: dt = LSTMenc dt−1 , xD t . We define the document encoding matrix as D = [d1 . . . dm d∅ ] ∈ R ×(m+1) . We also add a sentinel vector d∅ (Merity et al., 2016), which we later show allows the model to not attend to any particular word in the input. The question embeddings are computed with the same LSTM to share representation power: qt = LSTMenc qt−1 , xQt . We define an intermediate question representation Q = [q1 . . . qn q∅ ] ∈ R ×(n+1) . To allow for variation between the question encoding space and the document encod- ing space, we introduce a non-linear projection layer on top of the question encoding. The final representation for the question becomes: Q = tanh W (Q) Q + b(Q) ∈ R ×(n+1) . 2.2 C OATTENTION E NCODER We propose a coattention mechanism that attends to the question and document simultaneously, similar to (Lu et al., 2016), and finally fuses both attention contexts. Figure 2 provides an illustration of the coattention encoder. We first compute the affinity matrix, which contains affinity scores corresponding to all pairs of document words and question words: L = D Q ∈ R(m+1)×(n+1) . The affinity matrix is nor- malized row-wise to produce the attention weights AQ across the document for each word in the question, and column-wise to produce the attention weights AD across the question for each word in the document: AQ = softmax (L) ∈ R(m+1)×(n+1) and AD = softmax L ∈ R(n+1)×(m+1) (1) Next, we compute the summaries, or attention contexts, of the document in light of each word of the question. C Q = DAQ ∈ R ×(n+1) . (2) 2

3.Published as a conference paper at ICLR 2017 U: ut D: bi-LSTM bi-LSTM bi-LSTM bi-LSTM bi-LSTM ` m+1 document AD AQ CD concat product product CQ Q: concat ` n+1 Figure 2: Coattention encoder. The affinity matrix L is not shown here. We instead directly show the normalized attention weights AD and AQ . We similarly compute the summaries QAD of the question in light of each word of the document. Similar to Cui et al. (2016), we also compute the summaries C Q AD of the previous attention con- texts in light of each word of the document. These two operations can be done in parallel, as is shown in Eq. 3. One possible interpretation for the operation C Q AD is the mapping of question encoding into space of document encodings. C D = Q; C Q AD ∈ R2 ×(m+1) . (3) We define C D , a co-dependent representation of the question and document, as the coattention context. We use the notation [a; b] for concatenating the vectors a and b horizontally. The last step is the fusion of temporal information to the coattention context via a bidirectional LSTM: ut = Bi-LSTM ut−1 , ut+1 , dt ; cD t ∈ R2 . (4) We define U = [u1 , . . . , um ] ∈ R2 ×m , which provides a foundation for selecting which span may be the best possible answer, as the coattention encoding. 2.3 DYNAMIC P OINTING D ECODER Due to the nature of SQuAD, an intuitive method for producing the answer span is by predicting the start and end points of the span (Wang & Jiang, 2016b). However, given a question-document pair, there may exist several intuitive answer spans within the document, each corresponding to a local maxima. We propose an iterative technique to select an answer span by alternating between predicting the start point and predicting the end point. This iterative procedure allows the model to recover from initial local maxima corresponding to incorrect answer spans. Figure 3 provides an illustration of the Dynamic Decoder, which is similar to a state machine whose state is maintained by an LSTM-based sequential model. During each iteration, the decoder updates its state taking into account the coattention encoding corresponding to current estimates of the start and end positions, and produces, via a multilayer neural network, new estimates of the start and end positions. Let hi , si , and ei denote the hidden state of the LSTM, the estimate of the position, and the estimate of the end position during iteration i. The LSTM state update is then described by Eq. 5. hi = LSTM dec hi−1 , usi−1 ; uei−1 (5) where usi−1 and uei−1 are the representations corresponding to the previous estimate of the start and end positions in the coattention encoding U . 3

4.Published as a conference paper at ICLR 2017 L L S S T hi T hi+1 M M u si 1 u si HMN HMN argmax si : 49 u49 argmax ei : 51 u ei (steam) (turbine) u51 u ei 1 U: u48 u49 u50 u51 u52 … … 48 49 50 51 52 … … ing e am tur nt , bin pla us ste Figure 3: Dynamic Decoder. Blue denotes the variables and functions related to estimating the start position whereas red denotes the variables and functions related to estimating the end position. Given the current hidden state hi , previous start position usi−1 , and previous end position uei−1 , we estimate the current start position and end position via Eq. 6 and Eq. 7. si = argmax (α1 , . . . , αm ) (6) t ei = argmax (β1 , . . . , βm ) (7) t where αt and βt represent the start score and end score corresponding to the tth word in the doc- ument. We compute αt and βt with separate neural networks. These networks have the same architecture but do not share parameters. Based on the strong empirical performance of Maxout Networks (Goodfellow et al., 2013) and High- way Networks (Srivastava et al., 2015), especially with regards to deep architectures, we propose a Highway Maxout Network (HMN) to compute αt as described by Eq. 8. The intuition behind us- ing such model is that the QA task consists of multiple question types and document topics. These variations may require different models to estimate the answer span. Maxout provides a simple and effective way to pool across multiple model variations. αt = HMN start ut , hi , usi−1 , uei−1 (8) Here, ut is the coattention encoding corresponding to the tth word in the document. HMN start is illustrated in Figure 4. The end score, βt , is computed similarly to the start score αt , but using a separate HMN end . We now describe the HMN model: (1) (2) HMN ut , hi , usi−1 , uei−1 = max W (3) mt ; mt + b(3) (9) r = tanh W (D) hi ; usi−1 ; uei−1 (10) (1) mt = max W (1) [ut ; r] + b(1) (11) (2) (1) mt = max W (2) mt + b(2) (12) 4

5.Published as a conference paper at ICLR 2017 where r ∈ R is a non-linear projection of the cur- (1) … ↵48 ↵49 ↵50 ↵51 ↵52 … rent state with parameters W (D) ∈ R ×5 , mt is the output of the first maxout layer with parame- (2) MAXOUT ters W (1) ∈ Rp× ×3 and b(1) ∈ Rp× , and mt is the output of the second maxout layer with pa- m(2) (1) MAXOUT rameters W (2) ∈ Rp× × and b(2) ∈ Rp× . mt (2) and mt are fed into the final maxout layer, which m(1) has parameters W (3) ∈ Rp×1×2 , and b(3) ∈ Rp . p MAXOUT is the pooling size of each maxout layer. The max r operation computes the maximum value over the first dimension of a tensor. We note that there is U: u48 u49 u50 u51 u52 MLP highway connection between the output of the first … … maxout layer and the last maxout layer. u si 1 u ei 1 hi 48 49 50 51 52 To train the network, we minimize the cumulative … … ing e am tur nt , softmax cross entropy of the start and end points bin pla us ste across all iterations. The iterative procedure halts when both the estimate of the start position and the estimate of the end position no longer change, or when a maximum number of iterations is reached. Figure 4: Highway Maxout Network. Dotted Details can be found in Section 4.1 lines denote highway connections. 3 R ELATED W ORK Statistical QA Traditional approaches to question answering typically involve rule-based algorithms or linear classifiers over hand-engineered feature sets. Richardson et al. (2013) proposed two base- lines, one that uses simple lexical features such as a sliding window to match bags of words, and another that uses word-distances between words in the question and in the document. Berant et al. (2014) proposed an alternative approach in which one first learns a structured representation of the entities and relations in the document in the form of a knowledge base, then converts the question to a structured query with which to match the content of the knowledge base. Wang et al. (2015) described a statistical model using frame semantic features as well as syntactic features such as part of speech tags and dependency parses. Chen et al. (2016) proposed a competitive statistical baseline using a variety of carefully crafted lexical, syntactic, and word order features. Neural QA Neural attention models have been widely applied for machine comprehension or question-answering in NLP. Hermann et al. (2015) proposed an AttentiveReader model with the release of the CNN/Daily Mail cloze-style question answering dataset. Hill et al. (2016) released another dataset steming from the children’s book and proposed a window-based memory network. Kadlec et al. (2016) presented a pointer-style attention mechanism but performs only one attention step. Sordoni et al. (2016) introduced an iterative neural attention model and applied it to cloze-style machine comprehension tasks. Recently, Rajpurkar et al. (2016) released the SQuAD dataset. Different from cloze-style queries, answers include non-entities and longer phrases, and questions are more realistic. For SQuAD, Wang & Jiang (2016b) proposed an end-to-end neural network model that consists of a Match-LSTM encoder, originally introduced in Wang & Jiang (2016a), and a pointer network decoder (Vinyals et al., 2015); Yu et al. (2016) introduced a dynamic chunk reader, a neural reading comprehension model that extracts a set of answer candidates of variable lengths from the document and ranks them to answer the question. Lu et al. (2016) proposed a hierarchical co-attention model for visual question answering, which achieved state of the art result on the COCO-VQA dataset (Antol et al., 2015). In (Lu et al., 2016), the co-attention mechanism computes a conditional representation of the image given the question, as well as a conditional representation of the question given the image. Inspired by the above works, we propose a dynamic coattention model (DCN) that consists of a novel coattentive encoder and dynamic decoder. In our model, instead of estimating the start and end positions of the answer span in a single pass (Wang & Jiang, 2016b), we iteratively update the 5

6.Published as a conference paper at ICLR 2017 Model Dev EM Dev F1 Test EM Test F1 Ensemble DCN (Ours) 70.3 79.4 71.2 80.4 Microsoft Research Asia ∗ − − 69.4 78.3 Allen Institute ∗ 69.2 77.8 69.9 78.1 Singapore Management University ∗ 67.6 76.8 67.9 77.0 Google NYC ∗ 68.2 76.7 − − Single model DCN (Ours) 65.4 75.6 66.2 75.9 Microsoft Research Asia ∗ 65.9 75.2 65.5 75.0 Google NYC ∗ 66.4 74.9 − − Singapore Management University ∗ − − 64.7 73.7 Carnegie Mellon University ∗ − − 62.5 73.3 Dynamic Chunk Reader (Yu et al., 2016) 62.5 71.2 62.5 71.0 Match-LSTM (Wang & Jiang, 2016b) 59.1 70.0 59.5 70.3 Baseline (Rajpurkar et al., 2016) 40.0 51.0 40.4 51.0 Human (Rajpurkar et al., 2016) 81.4 91.0 82.3 91.2 Table 1: Leaderboard performance at the time of writing (Nov 4 2016). ∗ indicates that the model used for submission is unpublished. − indicates that the development scores were not publicly available at the time of writing. start and end positions in a similar fashion to the Iterative Conditional Modes algorithm (Besag, 1986). 4 E XPERIMENTS 4.1 I MPLEMENTATION D ETAILS We train and evaluate our model on the SQuAD dataset. To preprocess the corpus, we use the tokenizer from Stanford CoreNLP (Manning et al., 2014). We use as GloVe word vectors pre- trained on the 840B Common Crawl corpus (Pennington et al., 2014). We limit the vocabulary to words that are present in the Common Crawl corpus and set embeddings for out-of-vocabulary words to zero. Empirically, we found that training the embeddings consistently led to overfitting and subpar performance, and hence only report results with fixed word embeddings. We use a max sequence length of 600 during training and a hidden state size of 200 for all recurrent units, maxout layers, and linear layers. All LSTMs have randomly initialized parameters and an initial state of zero. Sentinel vectors are randomly initialized and optimized during training. For the dynamic decoder, we set the maximum number of iterations to 4 and use a maxout pool size of 16. We use dropout to regularize our network during training (Srivastava et al., 2014), and optimize the model using ADAM (Kingma & Ba, 2014). All models are implemented and trained with Chainer (Tokui et al., 2015). 4.2 R ESULTS Evaluation on the SQuAD dataset consists of two metrics. The exact match score (EM) calculates the exact string match between the predicted answer and a ground truth answer. The F1 score calculates the overlap between words in the predicted answer and a ground truth answer. Because a document-question pair may have several ground truth answers, the EM and F1 for a document- question pair is taken to be the maximum value across all ground truth answers. The overall metric is then computed by averaging over all document-question pairs. The offical SQuAD evaluation is hosted on CodaLab 2 . The training and development sets are publicly available while the test set is withheld. 2 https://worksheets.codalab.org 6

7.Published as a conference paper at ICLR 2017 The performance of the Dynamic Coattention Network on the SQuAD dataset, compared to other submitted models on the leaderboard 3 , is shown in Table 1. At the time of writing, our single- model DCN ranks first at 66.2% exact match and 75.9% F1 on the test data among single-model submissions. Our ensemble DCN ranks first overall at 71.6% exact match and 80.4% F1 on the test data. The DCN has the capability to estimate the start and end points of the answer span multiple times, each time conditioned on its previous estimates. By doing so, the model is able to explore local maxima corresponding to multiple plausible answers, as is shown in Figure 5. Question 1: Who recovered Tolbert's fumble? s:5 e : 22 s:6 e : 22 s : 21 e : 22 … Answer: Danny Trevathan Groundtruth: Danny Trevathan Question 2: What did the Kenyan business people hope for when meeting with the Chinese? s : 66 e : 66 s : 84 e : 94 … Answer: gain support from China for a planned $2.5 billion railway Groundtruth: support from China for a planned $2.5 billion railway Question 3: What kind of weapons did Tesla's treatise concern? s : 23 e : 25 s : 24 e : 26 s : 23 e : 25 s : 24 e : 26 … Answer: particle beam weapons Groundtruth: charged particle beam Figure 5: Examples of the start and end conditional distributions produced by the dynamic decoder. Odd (blue) rows denote the start distributions and even (red) rows denote the end distributions. i indicates the iteration number of the dynamic decoder. Higher probability mass is indicated by darker regions. The offset corresponding to the word with the highest probability mass is shown on the right hand side. The predicted span is underlined in red, and a ground truth answer span is underlined in green. For example, Question 1 in Figure 5 demonstrates an instance where the model initially guesses an incorrect start point and a correct end point. In subsequent iterations, the model adjusts the start point, ultimately arriving at the correct start point in iteration 3. Similarly, the model gradually shifts probability mass for the end point to the correct word. Question 2 shows an example in which both the start and end estimates are initially incorrect. The model then settles on the correct answer in the next iteration. 3 https://rajpurkar.github.io/SQuAD-explorer 7

8.Published as a conference paper at ICLR 2017 1.2 1.0 0.8 F1 0.6 0.4 0.2 0.0 0 100 200 300 400 500 600 700 0 5 10 15 20 25 30 35 0 5 10 15 20 25 # Tokens in Document # Tokens in Question Average # Tokens in Answer Figure 6: Performance of the DCN for various lengths of documents, questions, and answers. The blue dot indicates the mean F1 at given length. The vertical bar represents the standard deviation of F1s at a given length. While the dynamic nature of the decoder allows the model to escape initial local maxima corre- sponding to incorrect answers, Question 3 demonstrates a case where the model is unable to decide between multiple local maxima despite several iterations. Namely, the model alternates between the answers “charged particle beam” and “particle beam weapons” indefinitely. Empirically, we observe that the model, trained with a maximum iteration of 4, takes 2.7 iterations to converge to an answer on average. Model Ablation The perfor- Model Dev EM Dev F1 mance of our model and its Dynamic Coattention Network (DCN) ablations on the SQuAD de- pool size 16 HMN 65.4 75.6 velopment set is shown in Ta- pool size 8 HMN 64.4 74.9 ble 2. On the decoder side, pool size 4 HMN 65.2 75.2 we experiment with various DCN with 2-layer MLP instead of HMN 63.8 74.4 pool sizes for the HMN max- DCN with single iteration decoder 63.7 74.0 out layers, using a 2-layer DCN with Wang & Jiang (2016b) attention 63.7 73.7 MLP instead of a HMN, and forcing the HMN decoder to Table 2: Single model ablations on the development set. a single iteration. Empiri- cally, we achieve the best performance on the development set with an iterative HMN with pool size 16, and find that the model consistently benefits from a deeper, iterative decoder network. The performance improves as the number of maximum allowed iterations increases, with little improve- ment after 4 iterations. On the encoder side, replacing the coattention mechanism with an attention mechanism similar to Wang & Jiang (2016b) by setting C D to QAD in equation 3 results in a 1.9 point F1 drop. This suggests that, at an additional cost of a softmax computation and a dot product, the coattention mechanism provides a simple and effective means to better encode the document and question sequences. Further studies, such as performance without attention and performance on questions requiring different types of reasoning can be found in the appendix. Performance across length One point of inter- 1.2 est is how the performance of the DCN varies 1.0 with respect to the length of document. Intu- itively, we expect the model performance to de- 0.8 teriorate with longer examples, as is the case F1 0.6 with neural machine translation (Luong et al., 2015). However, as in shown in Figure 6, 0.4 there is no notable performance degradation for longer documents and questions contrary to our 0.2 6073 1242 1187 712 642 474 150 90 expectations. This suggests that the coattentive 0.0 encoder is largely agnostic to long documents, What Who How When Which Where Why Other Question Type and is able to focus on small sections of rel- evant text while ignoring the rest of the (po- Figure 7: Performance of the DCN across ques- tentially very long) document. We do note a tion types. The height of each bar represents the performance degradation with longer answers. mean F1 for the given question type. The lower However, this is intuitive given the nature of the number denotes how many instances in the dev set evaluation metric. Namely, it becomes increas- are of the corresponding question type. ingly challenging to compute the correct word span as the number of words increases. 8

9.Published as a conference paper at ICLR 2017 Performance across question type Another natural way to analyze the performance of the model is to examine its performance across question types. In Figure 7, we note that the mean F1 of DCN exceeds those of previous systems (Wang & Jiang, 2016b; Yu et al., 2016) across all question types. The DCN, like other models, is adept at “when” questions and struggles with the more complex “why” questions. Breakdown of F1 distribution Finally, we note that the DCN performance is highly bimodal. On the development set, the model perfectly predicts (100% F1) an answer for 62.2% of examples and predicts a completely wrong answer (0% F1) for 16.3% of examples. That is, the model picks out partial answers only 21.5% of the time. Upon qualitative inspections of the 0% F1 answers, some of which are shown in Appendix A.4, we observe that when the model is wrong, its mistakes tend to have the correct “answer type” (eg. person for a “who” question, method for a “how” question) and the answer boundaries encapsulate a well-defined phrase. 5 C ONCLUSION We proposed the Dynamic Coattention Network, an end-to-end neural network architecture for ques- tion answering. The DCN consists of a coattention encoder which learns co-dependent representa- tions of the question and of the document, and a dynamic decoder which iteratively estimates the answer span. We showed that the iterative nature of the model allows it to recover from initial lo- cal maxima corresponding to incorrect predictions. On the SQuAD dataset, the DCN achieves the state of the art results at 75.9% F1 with a single model and 80.4% F1 with an ensemble. The DCN significantly outperforms all other models. ACKNOWLEDGMENTS We thank Kazuma Hashimoto and Bryan McCann for their help and insights. R EFERENCES Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C Lawrence Zit- nick, and Devi Parikh. Vqa: Visual question answering. In Proceedings of the IEEE International Conference on Computer Vision, pp. 2425–2433, 2015. Jonathan Berant, Vivek Srikumar, Pei-Chun Chen, Abby Vander Linden, Brittany Harding, Brad Huang, Peter Clark, and Christopher D Manning. Modeling biological processes for reading comprehension. In EMNLP, 2014. Julian Besag. On the statistical analysis of dirty pictures. Journal of the Royal Statistical Society. Series B (Methodological), pp. 259–302, 1986. Danqi Chen, Jason Bolton, and Christopher D. Manning. A thorough examination of the cnn/daily mail reading comprehension task. In Association for Computational Linguistics (ACL), 2016. Yiming Cui, Zhipeng Chen, Si Wei, Shijin Wang, Ting Liu, and Guoping Hu. Attention-over- attention neural networks for reading comprehension. arXiv preprint arXiv:1607.04423, 2016. Ian J Goodfellow, David Warde-Farley, Mehdi Mirza, Aaron C Courville, and Yoshua Bengio. Max- out networks. ICML (3), 28:1319–1327, 2013. Karl Moritz Hermann, Tomas Kocisky, Edward Grefenstette, Lasse Espeholt, Will Kay, Mustafa Suleyman, and Phil Blunsom. Teaching machines to read and comprehend. In Advances in Neural Information Processing Systems, pp. 1693–1701, 2015. Felix Hill, Antoine Bordes, Sumit Chopra, and Jason Weston. The goldilocks principle: Reading children’s books with explicit memory representations. In ICLR, 2016. Sepp Hochreiter and J¨urgen Schmidhuber. Long short-term memory. Neural computation, 9(8): 1735–1780, 1997. Rudolf Kadlec, Martin Schmid, Ondrej Bajgar, and Jan Kleindienst. Text understanding with the attention sum reader network. arXiv preprint arXiv:1603.01547, 2016. 9

10.Published as a conference paper at ICLR 2017 Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014. Jiasen Lu, Jianwei Yang, Dhruv Batra, and Devi Parikh. Hierarchical question-image co-attention for visual question answering. arXiv preprint arXiv:1606.00061, 2016. Minh-Thang Luong, Hieu Pham, and Christopher D. Manning. Effective approaches to attention- based neural machine translation. In Proceedings of the 2015 Conference on Empirical Meth- ods in Natural Language Processing, pp. 1412–1421. Association for Computational Linguistics, September 2015. Christopher D Manning, Mihai Surdeanu, John Bauer, Jenny Rose Finkel, Steven Bethard, and David McClosky. The stanford corenlp natural language processing toolkit. In ACL (System Demonstrations), pp. 55–60, 2014. Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture models. arXiv preprint arXiv:1609.07843, 2016. Jeffrey Pennington, Richard Socher, and Christopher D Manning. Glove: Global vectors for word representation. In EMNLP, volume 14, pp. 1532–43, 2014. P. Rajpurkar, J. Zhang, K. Lopyrev, and P. Liang. Squad: 100,000+ questions for machine compre- hension of text. In Empirical Methods in Natural Language Processing (EMNLP), 2016. Matthew Richardson, Christopher JC Burges, and Erin Renshaw. Mctest: A challenge dataset for the open-domain machine comprehension of text. In EMNLP, volume 3, pp. 4, 2013. Alessandro Sordoni, Phillip Bachman, and Yoshua Bengio. Iterative alternating neural attention for machine reading. arXiv preprint arXiv:1606.02245, 2016. Nitish Srivastava, Geoffrey E Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15(1):1929–1958, 2014. Rupesh K Srivastava, Klaus Greff, and Juergen Schmidhuber. Training very deep networks. In Advances in Neural Information Processing Systems 28, pp. 2377–2385, 2015. Seiya Tokui, Kenta Oono, Shohei Hido, and Justin Clayton. Chainer: a next-generation open source framework for deep learning. In Proceedings of Workshop on Machine Learning Sys- tems (LearningSys) in The Twenty-ninth Annual Conference on Neural Information Processing Systems (NIPS), 2015. Oriol Vinyals, Meire Fortunato, and Navdeep Jaitly. Pointer networks. In Advances in Neural Information Processing Systems, pp. 2692–2700, 2015. Hai Wang, Mohit Bansal, Kevin Gimpel, and David McAllester. Machine comprehension with syntax, frames, and semantics. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), pp. 700–706. Association for Computational Linguistics, 2015. Shuohang Wang and Jing Jiang. Learning natural language inference with LSTM. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 1442–1451. Association for Computational Lin- guistics, 2016a. Shuohang Wang and Jing Jiang. Machine comprehension using match-LSTM and answer pointer. arXiv preprint arXiv:1608.07905, 2016b. Y. Yu, W. Zhang, K. Hasan, M. Yu, B. Xiang, and B. Zhou. End-to-End Reading Comprehension with Dynamic Answer Chunk Ranking. ArXiv e-prints, October 2016. Yang Yu, Wei Zhang, Kazi Hasan, Mo Yu, Bing Xiang, and Bowen Zhou. End-to-end answer chunk extraction and ranking for reading comprehension. arXiv preprint arXiv:1610.09996v2, 2016. 10

11.Published as a conference paper at ICLR 2017 A A PPENDIX A.1 P ERFORMANCE WITHOUT ATTENTION In our experiments, we also investigate a model without any attention mechanism. In this model, the encoder is a simple LSTM network that first ingests the question and then ingests the document. The hidden states corresponding to words in the document is then passed to the decoder. This model achieves 33.3% exact match and 41.9% F1, significantly worse than models with attention. A.2 S AMPLES REQUIRING DIFFERENT TYPES OF REASONING We generate predictions for examples requiring different types of reasoning, given by Rajpurkar et al. (2016). Because this set of examples is very limited, they do not conclusively demonstrate the effectiveness of the model on different types of reasoning tasks. Nevertheless, these examples show that the DCN is a promising architecture for challenging question answering tasks including those that involve reasoning over multiple sentences. W HAT IS THE R ANKINE CYCLE SOMETIMES CALLED ? The Rankine cycle is sometimes referred to as a practical Carnot cycle because, when an efficient turbine is used, the TS diagram begins to resemble the Carnot cycle. Type of reasoning Lexical variation (synonymy) Ground truth practical Carnot cycle Prediction practical Carnot cycle W HICH TWO GOVERNING BODIES HAVE LEGISLATIVE VETO POWER ? While the Commision has a monopoly on initiating legislation, the European Parliament and the Council of the European Union have powers of amendment and veto during the legislative progress. Type of reasoning Lexical variation (world knowledge) Ground truth the European Parliament and the Council of the European Union Prediction European Parliament and the Council of the European Union W HAT S HAKESPEARE SCHOLAR IS CURRENTLY ON THE UNIVERSITYS FACULTY ? Current faculty include the anthropologist Marshall Sahlins, historian Dipesh Chakrabarty, ... Shake- speare scholar David Bevington, and renowned political scientists John Mearsheimer and Robert Pape. Type of reasoning Syntactic variation Ground truth David Bevington Prediction David Bevington W HAT COLLECTION DOES THE V&A T HEATRE & P ERFORMANCE GALLERIES HOLD ? The V&A Theatre & Performance galleries, formerly the Theatre Museum, opened in March 2009. The collections are stored by the V&A, and are available for research, exhibitions and other shows. They hold the UK’s biggest national collection of material about live performance in the UK since Shakespeare’s day, covering drama, dance, musical theatre, circus, music hall, rock and pop, and most other forms of live entertainment. Type of reasoning Multiple sentence reasoning Ground truth Material about live performance 11

12.Published as a conference paper at ICLR 2017 Prediction UK’s biggest national collection of material about live performance in the UK since Shakespeare’s day W HAT IS THE MAIN GOAL OF CRIMINAL PUNISHMENT OF CIVIL DISOBEDIENTS ? Type of reasoning Ambiguous Along with giving the offender his ”just deserts”, achieving crime control via incapacitation and deterrence is a major goal of crime punishment. Ground truth achieving crime control via incapacitation and deterrence Prediction achieving crime control via incapacitation and deterrence A.3 S AMPLES OF CORRECT SQ UAD PREDICTIONS BY THE DYNAMIC C OATTENTION N ETWORK H OW DID THE M ONGOLS ACQUIRE C HINESE PRINTING TECHNOLOGY ? ID: 572882242ca10214002da420 The Mongol rulers patronized the Yuan printing industry. Chinese printing technology was trans- ferred to the Mongols through Kingdom of Qocho and Tibetan intermediaries. Some Yuan docu- ments such as Wang Zhen’s Nong Shu were printed with earthenware movable type, a technology invented in the 12th century. However, most published works were still produced through tradi- tional block printing techniques. The publication of a Taoist text inscribed with the name of Tregene Khatun, gedei’s wife, is one of the first printed works sponsored by the Mongols. In 1273, the Mongols created the Imperial Library Directorate, a government-sponsored printing office. The Yuan government established centers for printing throughout China. Local schools and government agencies were funded to support the publishing of books. Ground truth through Kingdom of Qocho and Tibetan intermediaries Prediction: through Kingdom of Qocho and Tibetan intermediaries W HO APPOINTS ELDERS ? ID 5730d473b7151e1900c0155b Elders are called by God, affirmed by the church, and ordained by a bishop to a ministry of Word, Sacrament, Order and Service within the church. They may be appointed to the local church, or to other valid extension ministries of the church. Elders are given the authority to preach the Word of God, administer the sacraments of the church, to provide care and counseling, and to order the life of the church for ministry and mission. Elders may also be assigned as District Superintendents, and they are eligible for election to the episcopacy. Elders serve a term of 23 years as provisional Elders prior to their ordination. Ground truth bishop, the local church Prediction a bishop A N ALGORITHM FOR X WHICH REDUCES TO C WOULD ALLOW US TO DO WHAT ? ID 56e1ce08e3433e14004231a6 This motivates the concept of a problem being hard for a complexity class. A problem X is hard for a class of problems C if every problem in C can be reduced to X. Thus no problem in C is harder than X, since an algorithm for X allows us to solve any problem in C. Of course, the notion of hard problems depends on the type of reduction being used. For complexity classes larger than P, polynomial-time reductions are commonly used. In particular, the set of problems that are hard for NP is the set of NP-hard problems. Ground truth solve any problem in C 12

13.Published as a conference paper at ICLR 2017 Prediction solve any problem in C H OW MANY GENERAL QUESTIONS ARE AVAILABLE TO OPPOSITION LEADERS ? ID 572fd7b8947a6a140053cd3e Parliamentary time is also set aside for question periods in the debating chamber. A ”General Ques- tion Time” takes place on a Thursday between 11:40 a.m. and 12 p.m. where members can direct questions to any member of the Scottish Government. At 2.30pm, a 40-minute long themed ”Ques- tion Time” takes place, where members can ask questions of ministers in departments that are se- lected for questioning that sitting day, such as health and justice or education and transport. Between 12 p.m. and 12:30 p.m. on Thursdays, when Parliament is sitting, First Minister’s Question Time takes place. This gives members an opportunity to question the First Minister directly on issues under their jurisdiction. Opposition leaders ask a general question of the First Minister and then supplementary questions. Such a practice enables a ”lead-in” to the questioner, who then uses their supplementary question to ask the First Minister any issue. The four general questions available to opposition leaders are: Ground truth four Prediction four W HAT ARE SOME OF THE ACCEPTED GENERAL PRINCIPLES OF E UROPEAN U NION LAW ? ID 5726a00cf1498d1400e8e551 The principles of European Union law are rules of law which have been developed by the European Court of Justice that constitute unwritten rules which are not expressly provided for in the treaties but which affect how European Union law is interpreted and applies. In formulating these principles, the courts have drawn on a variety of sources, including: public international law and legal doctrines and principles present in the legal systems of European Union member states and in the jurisprudence of the European Court of Human Rights. Accepted general principles of European Union Law include fundamental rights (see human rights), proportionality, legal certainty, equality before the law and subsidiarity. Ground truth fundamental rights (see human rights), proportionality, legal certainty, equality be- fore the law and subsidiarity Prediction fundamental rights (see human rights), proportionality, legal certainty, equality before the law and subsidiarity W HY WAS T ESLA RETURNED TO G OSPIC ? ID 56dfaa047aa994140058dfbd On 24 March 1879, Tesla was returned to Gospi under police guard for not having a residence permit. On 17 April 1879, Milutin Tesla died at the age of 60 after contracting an unspecified illness (although some sources say that he died of a stroke). During that year, Tesla taught a large class of students in his old school, Higher Real Gymnasium, in Gospi. Ground truth not having a residence permit Prediction not having a residence permit A.4 S AMPLES OF INCORRECT SQ UAD PREDICTIONS BY THE DYNAMIC C OATTENTION N ETWORK W HAT IS ONE SUPPLEMENTARY SOURCE OF E UROPEAN U NION LAW ? ID 5725c3a9ec44d21400f3d506 European Union law is applied by the courts of member states and the Court of Justice of the Euro- pean Union. Where the laws of member states provide for lesser rights European Union law can be 13

14.Published as a conference paper at ICLR 2017 enforced by the courts of member states. In case of European Union law which should have been transposed into the laws of member states, such as Directives, the European Commission can take proceedings against the member state under the Treaty on the Functioning of the European Union. The European Court of Justice is the highest court able to interpret European Union law. Supple- mentary sources of European Union law include case law by the Court of Justice, international law and general principles of European Union law. Ground truth international law Prediction case law by the Court of Justice Comment The prediction produced by the model is correct, however it was not selected by Mechan- ical Turk annotators. W HO DESIGNED THE ILLUMINATION SYSTEMS THAT T ESLA E LECTRIC L IGHT & M ANUFACTURING INSTALLED ? ID 56e0d6cf231d4119001ac424 After leaving Edison’s company Tesla partnered with two businessmen in 1886, Robert Lane and Benjamin Vail, who agreed to finance an electric lighting company in Tesla’s name, Tesla Electric Light & Manufacturing. The company installed electrical arc light based illumination systems de- signed by Tesla and also had designs for dynamo electric machine commutators, the first patents issued to Tesla in the US. Ground truth Tesla Prediction Robert Lane and Benjamin Vail Comment The model produces an incorrect prediction that corresponds to people that funded Tesla, instead of Tesla who actually designed the illumination system. Empirically, we find that most mistakes made by the model have the correct type (eg. named entity type) despite not including types as prior knowledge to the model. In this case, the incorrect response has the correct type of person. C YDIPPID ARE TYPICALLY WHAT SHAPE ? ID 57265746dd62a815002e821a Cydippid ctenophores have bodies that are more or less rounded, sometimes nearly spherical and other times more cylindrical or egg-shaped; the common coastal ”sea gooseberry,” Pleurobrachia, sometimes has an egg-shaped body with the mouth at the narrow end, although some individuals are more uniformly round. From opposite sides of the body extends a pair of long, slender tentacles, each housed in a sheath into which it can be withdrawn. Some species of cydippids have bodies that are flattened to various extents, so that they are wider in the plane of the tentacles. Ground truth more or less rounded, egg-shaped Prediction spherical Comment Although the mistake is subtle, the prediction is incorrect. The statement “are more or less rounded, sometimes nearly spherical” suggests that the entity is more often “rounded” than “spherical” or “cylindrical” or “egg-shaped” (an answer given by an annotator). This suggests that the model has trouble discerning among multiple intuitive answers due to a lack of understanding of the relative severity of “more or less” versus “sometimes” and “other times”. 14