Read + Verify: Machine Reading Comprehension with Unanswerable Questions

Machine reading comprehension with unanswerable questions aims to abstain from answering when no answer can be inferred. In addition to extract answers, previous works usually predict an additional “noanswer” probability to detect unanswerable cases. However, they fail to validate the answerability of the question by verifying the legitimacy of the predicted answer. To address this problem, we propose a novel read-then-verify system, which not only utilizes a neural reader to extract candidate answers and produce noanswer probabilities, but also leverages an answer verifier to decide whether the predicted answer is entailed by the input snippets. Moreover, we introduce two auxiliary losses to help the reader better handle answer extraction as well as noanswer detection, and investigate three different architectures for the answer verifier. Our experiments on the SQuAD 2.0 dataset show that our system obtains a score of 74.2 F1 on test set, achieving state-of-the-art results at the time of sub-mission (Aug. 28th, 2018).

1. Read + Verify: Machine Reading Comprehension with Unanswerable Questions Minghao Hu1∗ , Furu Wei2 , Yuxing Peng1 , Zhen Huang1 , Nan Yang2 , Dongsheng Li1 1 College of Computer, National University of Defense Technology 2 Microsoft Research Asia {huminghao09,pengyuxing,huangzhen,dsli} {fuwei,nanya} arXiv:1808.05759v5 [cs.CL] 15 Nov 2018 Abstract Question: What is France a region of? Machine reading comprehension with unanswerable ques- tions aims to abstain from answering when no answer can Passage: The Normans were the people who in be inferred. In addition to extract answers, previous works the 10th and 11th centuries gave their name to usually predict an additional “no-answer” probability to de- Normandy, a region in France. They were ... tect unanswerable cases. However, they fail to validate the answerability of the question by verifying the legitimacy of Read the predicted answer. To address this problem, we propose a novel read-then-verify system, which not only utilizes a neural reader to extract candidate answers and produce no- Answer: Normandy NA Prob: 0.4 answer probabilities, but also leverages an answer verifier to decide whether the predicted answer is entailed by the in- Sentence: The Normans … in France. put snippets. Moreover, we introduce two auxiliary losses to help the reader better handle answer extraction as well as no- Verify answer detection, and investigate three different architectures for the answer verifier. Our experiments on the SQuAD 2.0 dataset show that our system obtains a score of 74.2 F1 on NA Prob: 0.9 Final Prob: 0.65 test set, achieving state-of-the-art results at the time of sub- mission (Aug. 28th, 2018). No answer Introduction Figure 1: An overview of our approach. The reader first ex- The ability to comprehend text and answer questions is tracts a candidate answer and produces a no-answer proba- crucial for natural language processing. Due to the cre- bility (NA Prob). The answer verifier then checks whether ation of various large-scale datasets (Hermann et al. 2015; the extracted answer is legitimate or not. Finally, the system Nguyen et al. 2016; Joshi et al. 2017; Koˇcisk`y et al. 2018), aggregates previous results and outputs the final prediction. remarkable advancements have been made in the task of ma- chine reading comprehension. Nevertheless, one important operation between a “no-answer” score and answer span hypothesis behind current approaches is that there always scores, so as to produce a probability that a question is unan- exists a correct answer in the context passage. Therefore, the swerable as well as output a candidate answer. However, models only need to choose a most plausible text span based they have not considered further validating the answerability on the question, instead of checking if there exists an answer of the question by verifying the legitimacy of the predicted in the first place. Recently, a new version of Stanford Ques- answer. Here, answerability denotes whether the question tion Answering Dataset (SQuAD), namely SQuAD 2.0 (Ra- has an answer, and legitimacy means whether the extracted jpurkar, Jia, and Liang 2018), has been proposed to test the text can be supported by the passage and the question. Hu- ability of answering answerable questions as well as detect- man, on the contrary, tends to first find a plausible answer ing unanswerable cases. To deal with unanswerable cases, given a question, and then checks if there exists any contra- systems must learn to identify a wide range of linguistic dictory semantics. phenomena such as negation, antonymy and entity changes To address the above issue, we propose a read-then-verify between the passage and the question. system that aims to be robust to unanswerable questions in Previous works (Levy et al. 2017; Clark and Gardner this paper. As shown in Figure 1, our system consists of two 2018; Kundu and Ng 2018) all apply a shared-normalization components: (1) a no-answer reader for extracting candi- ∗ Contribution during internship at Microsoft Research Asia. date answers and detecting unanswerable questions, and (2) Copyright c 2019, Association for the Advancement of Artificial an answer verifier for deciding whether or not the extracted Intelligence ( All rights reserved. candidate is legitimate. The key contributions of our work

2.are three-fold. able reading comprehension task, and then investigate cur- First, we augment existing readers with two auxiliary rent solutions. losses, to better handle answer extraction and no-answer de- tection respectively. Since the downstream verifying stage Task Description always requires a candidate answer, the reader must be able Given a context passage and a question, the machine needs to extract plausible answers for all questions. However, pre- to not only find answers to answerable questions but also de- vious approaches are not trained to find potential candidates tect unanswerable cases. The passage and the question are for unanswerable questions. We solve this problem by in- described as sequences of word tokens, denoted as P = troducing an independent span loss that aims to concentrate lp {xpi }i=1 lq and Q = {xqj }j=1 respectively, where lp is the on the answer extraction task regardless of the answerabil- passage length and lq is the question length. Our goal is to ity of the question. In order to not conflict with no-answer predict an answer A, which is constrained as a segment of detection, we leverage a multi-head pointer network to gen- erate two pairs of span scores, where one pair is normal- text in the passage: A = {xpi }li=l b a , or return an empty string ized with the no-answer score and the other is used for our if there is no answer, where la and lb indicate the answer auxiliary loss. Besides, we present another independent no- boundary. answer loss to further alleviate the confliction, by focus- No-Answer Reader ing on the no-answer detection task without considering the shared normalization of answer extraction. To predict an answer span, current approaches first embed Second, in addition to the standard reading phase, we in- and encode both of passage and question into two series troduce an additional answer verifying phase, which aims at of fix-sized vectors. Then they leverage various attention finding local entailment that supports the answer by compar- mechanisms, such as bi-attention (Seo et al. 2017) or reat- ing the answer sentence with the question. This is based on tention (Hu et al. 2018a), to build interdependent repre- the observation that the core phenomenon of unanswerable sentations for passage and question, which are denoted as lp lq questions usually occurs between a few passage words and U = {ui }i=1 and V = {vj }j=1 respectively. Finally, they question words. Take Figure 1 for example, after comparing summarize the question representation into a dense vector the passage snippet “Normandy, a region in France” with t, and utilize the pointer network (Vinyals, Fortunato, and the question, we can easily determine that no answer exists Jaitly 2015) to produce two scores over passage words that since the question asks for an impossible condition1 . This indicate the answer boundary (Wang et al. 2017): observation is even more obvious when antonym or mutual lq exclusion occurs, such as the question asks for “the decline eoj oj = wvT vj , t = lq vj of rainforests” but the passage mentions that “the rainforests ok j=1 k=1 e spread out”. Inspired by recent advances in natural language inference (NLI) (Bowman et al. 2015), we investigate three α, β = pointer network(U, t) different architectures for the answer verifying task. The first where α and β are the span scores for answer start and end one is a sequential model that takes two sentences as a long bounds. sequence, while the second one attempts to capture interac- In order to additionally detect if the question is unanswer- tions between two sentences. The last one is a hybrid model able, previous approaches (Levy et al. 2017; Clark and Gard- that combines the above two models to test if the perfor- ner 2018; Kundu and Ng 2018) attempt to predict a special mance can be further improved. no-answer score z in addition to the distribution over answer Lastly, we evaluate our system on the SQuAD 2.0 spans. Concretely, a shared softmax function can be applied dataset (Rajpurkar, Jia, and Liang 2018), a reading com- to normalize both of no-answer score and span scores, yield- prehension benchmark augmented with unanswerable ques- ing a joint no-answer objective defined as: tions. Our best reader achieves a F1 score of 73.7 and 69.1 on the development set, with or without ELMo embed- (1 − δ)ez + δeαa βb Ljoint = − log dings (Peters et al. 2018). When combined with the answer ez + lp lp αi βj i=1 j=1 e verifier, the whole system improves to 74.8 F1 and 71.5 F1 respectively. Moreover, the best system obtains a score of where a and b are the ground-truth start and end positions, 74.2 F1 on test set, achieving state-of-the-art results at the and δ is 1 if the question is answerable and 0 otherwise. At time of submission (Aug. 28th, 2018). test time, a question is detected as being unanswerable once the normalized no-answer score exceeds some threshold. Background Existing reading comprehension models focus on answer- Approach ing questions where a correct answer is guaranteed to ex- In this section we describe our proposed read-then-verify ist. However, they are not able to identify unanswerable system. The system first leverages a neural reader to extract questions but tend to return an unreliable text span. Conse- a candidate answer and detect if the question is unanswer- quently, we first give a brief introduction on the unanswer- able. It then utilizes an answer verifier to further check the legitimacy of the predicted answer. We enhance the reader 1 with two novel auxiliary losses, and investigate three differ- Impossible condition means that the question asks for some- thing that is not satisfied by anything in the given passage. ent architectures for the answer verifier.

3. Answer y y Sentence Question Answer Model-I y Mean-max Pooling Answer Add & Norm s q Sentence BiLSTM BiLSTM Question Model-II y Feed Forward Intra-Sent Intra-Sent Modeling Modeling 12x Answer s q Add & Norm Answer Inference Sentence Question Answer Model-I Masked Multi Modeling Self Attention Merge y s q BiLSTM BiLSTM Model-II Text & Position Embed Text Embed Text Embed (a) (b) (c) Figure 2: An overview of answer verifiers. (a) Input structures for running three different models. (b) Generative Pre-trained Transformer proposed by Radford et al. (2018). Here, “Masked Multi Self Attention” refers to multi-head self-attention func- tion (Vaswani et al. 2017) that only attends to previous tokens. “Add & Norm” indicates residual connection and layer normal- ization. (c) Our proposed token-wise interaction model, which is designed to compare two sentences and aggregate the results for verifying the answer. Reader with Auxiliary Losses α ˜ ˜ and β: lq Although previous no-answer readers are capable of jointly eo˜j ˜vT vj , t˜ = o˜j = w lq vj learning answer extraction and no-answer detection, there o ˜k j=1 k=1 e exists two problems for each individual task. For the answer extraction, previous readers are not trained to find candidate ˜ , β˜ = pointer network(U, t˜) α answers for unanswerable questions. In our system, how- where multiple heads share the same network architecture ever, the reader is required to extract a plausible answer that but with different parameters. is fed to the downstream verifying stage for all questions. Then, we define an independent span loss as: As for no-answer detection, a confliction could be triggered ˜ due to the shared normalization between span scores and no- eα˜ a˜ β˜b answer score. Since the sum of these normalized scores is Lindep−I = − log lp lp ˜ i β˜j α always 1, an over-confident span probability would cause an i=1 j=1 e unconfident no-answer probability, and vice versa. There- where a˜ and ˜b are the augmented ground-truth answer fore, inaccurate confidence on answer span, which has been boundaries. The final span probability is obtained using observed by Clark et al. (2018), could lead to imprecise pre- a simple mean pooling over the two pairs of softmax- diction on no-answer score. To address the above issues, we normalized span scores. propose two auxiliary losses to optimize and enhance each task independently without interfering with each other. Independent No-Answer Loss Despite a multi-head pointer network being used to prevent the confliction prob- lem, no-answer detection can still be weakened since the Independent Span Loss This loss is designed to concen- no-answer score z is normalized with span scores. There- trate on answer extraction. In this task, the model is asked to fore, we consider exclusively encouraging the prediction on extract candidate answers for all possible questions. There- no-answer detection. This is achieved by introducing an in- fore, besides answerable questions, we also include unan- dependent no-answer loss as: swerable cases as positive examples, and consider the plau- sible answer as gold answer2 . In order to not conflict with Lindep−II = −(1 − δ) log σ(z) − δ log(1 − σ(z)) no-answer detection, we propose to use a multi-head pointer where σ is the sigmoid activation function. Through this network to additionally produce another pair of span scores loss, we expect the model to produce a more confident prediction on no-answer score z without considering the 2 shared-normalization operation. In SQuAD 2.0, the plausible answer is annotated by human for Finally, we combine the above losses as follows: every unanswerable question. A pre-trained reader can also be used to extract plausible answers if no annotation is provided. L = Ljoint + γLindep−I + λLindep−II

4.where γ and λ are two hyper-parameters that control the bidirectional LSTM (BiLSTM) (Hochreiter and Schmidhu- weight of two auxiliary losses. ber 1997) to encode the characters and concatenate two last hidden states to get character-level embeddings. In addition, Answer Verifier we use a binary feature to indicate if a word is part of the After the answer is extracted, an answer verifier is used to answer. All embeddings along with the feature are then con- compare the answer sentence with the question, so as to catenated and encoded by a weight-shared BiLSTM, yield- recognize local textual entailment that supports the answer. ing two series of contextual representations: Here, we define the answer sentence as the context sentence that contains either gold answers or plausible answers. We si = BiLSTM([wordsi ; charsi ; feasi ]), ∀i ∈ [1, ls ] explore three different architectures, as shown in Figure 2: qj = BiLSTM([wordqj ; charqj ; feaqj ]), ∀j ∈ [1, lq ] (1) a sequential model that takes the inputs as a long se- quence, (2) an interactive model that encodes two sentences where ls is the length of answer sentence, and [·; ·] denotes interdependently, and (3) a hybrid model that takes both of concatenation. the two approaches into account. Inference Modeling: An inference modeling layer is used to capture the interactions between two sentences and pro- Model-I: Sequential Architecture In Model-I, we con- duce two inference-aware sentence representations. We first vert the answer sentence and the question along with the compute the dot products of all tuples < si , qj > as attention extracted answer into an ordered input sequence. Then we weights, and then normalize these weights so as to obtain at- adapt the recently proposed Generative Pre-trained Trans- tended vectors as follows: former (OpenAI GPT) (Radford et al. 2018) to perform the task. The model is a multi-layer Transformer decoder (Liu et aij = sT i qj , ∀i ∈ [1, ls ], ∀j ∈ [1, lq ] al. 2018a), which is first trained with a language modeling lq ls objective on a large unlabeled text corpus and then finetuned eaij eaij bi = lq qj , cj = ls si on the specific target task. aik akj j=1 k=1 e i=1 k=1 e Specifically, given an answer sentence S, a question Q and an extracted answer A, we concatenate the two sen- Here, bi refers to the attended vector from question Q for the tences with the answer while adding a delimiter token in be- i-th word in answer sentence S, and vice versa for cj . tween to get [S; Q; $; A]. We then embed the sequence with Next, in order to separately compare the aligned pairs its word embedding as well as position embedding. Multiple {(si , bi )}li=1 s lq and {(qj , cj )}j=1 for finding local inference transformer blocks are used to encode the sequence embed- information, we use a weight-shared function F to model dings as follows: these aligned pairs as: h0 = We [X] + Wp s˜i = F (si , bi ) , q˜j = F (qj , cj ) hi = transformer block(hi−1 ), ∀i ∈ [1, n] where X denotes the sequence’s indexes in the vocab, We F can have various forms, such as BiLSTM, multilayer is the token embedding matrix, Wp is the position embed- perceptron, and so on. Here we use a heuristic function ding matrix, and n is the number of transformer blocks. o = F (x, y) proposed by Hu et al. (2018a), which demon- Each block consists of a masked multi-head self-attention strates good performances compared to other options: layer (Vaswani et al. 2017) and a position-wise feed-forward r = gelu (Wr [x; y; x ◦ y; x − y]) layer. Residual connection and layer normalization are used after each layer. g = σ (Wg [x; y; x ◦ y; x − y]) The last token’s activation hlnm is then fed into a linear o = g ◦ r + (1 − g) ◦ x projection layer followed by a softmax function to output the no-answer probability y: where gelu is the Gaussian Error Linear Unit (Hendrycks and Gimpel 2016), ◦ is element-wise multiplication, and the p(y|X) = softmax(hlnm Wy ) bias term is omitted. A standard cross-entropy objective is used to minimize Intra-Sentence Modeling: Next we apply an intra-sentence the negative log-likelihood: modeling layer to capture self correlations inside each sen- tence. The input are inference-aware vectors s˜i and q˜j , L(θ) = − log p(y|X) which are first passed through another BiLSTM layer for (X,y) encoding. We then use the same attention mechanism de- scribed above, only now between each sentence and itself, Model-II: Interactive Architecture In Model-II, we con- and we set aij = −inf if i = j to ensure that the word is sider an interactive architecture that aims to capture the in- not aligned with itself. Another function F is used to pro- teractions between two sentences, so as to recognize their duce self-aware vectors sˆi and qˆj respectively. local entailment relationships for verifying the answer. This Prediction: Before the final prediction, we apply a concate- model consists of the following layers: nated residual connection and model the sentences with a Encoding: We embed words using the GloVe embed- BiLSTM as: ding (Pennington, Socher, and Manning 2014), and also em- bed characters of each word with trainable vectors. We run a s¯i = BiLSTM([˜ si ; sˆi ]) , q¯j = BiLSTM([˜ qj ; qˆj ])

5. A mean-max pooling operation is then applied to sum- Dev Test marize the final representation of two sentences, namely s¯i Model EM F1 EM F1 and q¯j . All summarized vectors are then concatenated and fed into a feed-forward classifier that consists of a projection BNA1 59.8 62.6 59.2 62.1 sublayer with gelu activation and a softmax output sublayer, DocQA2 61.9 64.8 59.3 62.3 yielding the no-answer probability. As before, we optimize DocQA + ELMo 65.1 67.6 63.4 66.3 the negative log-likelihood objective function. ARRR† - - 68.6 71.1 VS3 −Net† - - 68.4 71.3 Model-III: Hybrid Architecture To explore how the fea- SAN3 - - 68.6 71.4 tures extracted by Model-I and Model-II can be integrated FusionNet++(ensemble)4 - - 70.3 72.6 to obtain better representation capacities, we investigate the SLQA+5 - - 71.5 74.4 combination of the above two models, namely Model-III. RMR + ELMo + Verifier 72.3 74.8 71.7 74.2 We merge the output vectors of two models into a single joint representation. An unified feed-forward classifier is Human 86.3 89.0 86.9 89.5 then applied to output the no-answer probability. Such de- sign allows us to test whether the performance can benefit Table 1: Comparison of different approaches on the SQuAD from the integration of two different architectures. In prac- 2.0 test set, extracted on Aug 28, 2018: Levy et al. (2017)1 , tice we use a simple concatenation to merge the two sources Clark et al. (2018)2 , Liu et al. (2018b)3 , Huang et al. (2018)4 of information. and Wang et al. (2018)5 . † indicates unpublished works. Experimental Setup Implementation Dataset We evaluate our approach on the SQuAD 2.0 dataset (Ra- We use the Reinforced Mnemonic Reader (RMR) (Hu et al. jpurkar, Jia, and Liang 2018). SQuAD 2.0 is a new ma- 2018a), one of the state-of-the-art reading comprehension chine reading comprehension benchmark that aims to test models on the SQuAD 1.1 dataset, as our base reader. The the models whether they have truely understood the ques- reader is configurated with its default setting, and trained tions by knowing what they don’t know. It combines answer- with the no-answer objective with our auxiliary losses. able questions from the previous SQuAD 1.1 dataset (Ra- ELMo (Embeddings from Language Models) (Peters et al. jpurkar et al. 2016) with 53,775 unanswerable questions 2018) is exclusively listed in our experimental configura- about the same passages. Crowdsourcing workers craft these tion. We run a grid search on γ and λ among [0.1, 0.3, 0.5, questions with a plausible answer in mind, and make sure 0.7, 1, 2]. Based on the performance on development set, that they are relevant to the corresponding passages. we set γ as 0.3 and λ to be 1. As for answer verifiers, we use the original configuration from Radford et al. (2018) for Training and Inference Model-I. For Model-II, the Adam optimizer (Kingma and Our no-answer reader is trained on context passages, while Ba 2014) with a learning rate of 0.0008 is used, the hidden the answer verifier is trained on oracle answer sentences. size is set as 300, and a dropout (Srivastava et al. 2014) of Model-I follows a procedure of unsupervised pre-training 0.3 is applied for preventing overfitting. The batch size is 48 and supervised fine-tuning. That is, the model is first opti- for the reader, 64 for Model-II, and 32 for Model-I as well mized with a language modeling objective on a large unla- as Model-III. We use the GloVe (Pennington, Socher, and beled text corpus to initialize its parameters. Then it adapts Manning 2014) 100D embeddings for the reader, and 300D the parameters to the answer verifying task with our super- embeddings for Model-II and Model-III. We utilize the nltk vised objective. For Model-II, we directly train it with the tokenizer3 to preprocess passages and questions, as well as supervised loss. Model-III, however, consists of two differ- split sentences. The passages and the sentences are truncated ent architectures that require different training procedures. to not exceed 300 words and 150 words respectively. Therefore, we initialize Model-III with the pre-trained pa- rameters from both of Model-I and Model-II, and then fine- Evaluation tune the whole model until convergence. At test time, the reader first predicts a candidate answer as Main Results well as a passage-level no-answer probability. The answer verifier then validates the extracted answer along with its We first submit our approach on the hidden test set of sentence and outputs a sentence-level probability. Following SQuAD 2.0 for evaluation, which is shown in Table 1. We the official evaluation setting, a question is detected to be use Model-III as the default answer verifier, and only report unanswerable once the joint no-answer probability, which is the best result. As we can see, our system obtains state-of- computed as the mean of the above two probabilities, ex- the-art results by achieving an EM score of 71.7 and a F1 ceeds some threshold. We tune this threshold to maximize score of 74.2 on the test set. Notice that SLQA+ has reached F1 score on the development set, and report both of EM a comparable result compared to our approach. We argue (Exact Match) and F1 metrics. We also evaluate the per- that its promising result is largely due to its superior perfor- formance on no-answer detection with an accuracy metric 3 (ACC), where its threshold is set as 0.5 by default.

6. HasAns All NoAns All NoAns Configuration Configuration EM F1 EM F1 ACC EM F1 ACC RMR 72.6 81.6 66.9 69.1 73.1 RMR 66.9 69.1 73.1 - indep-I 71.3 80.4 66.0 68.6 72.8 + Model-I 68.3 71.1 76.2 - indep-II 72.4 81.4 64.0 66.1 69.8 + Model-II 68.1 70.8 75.6 - both 71.9 80.9 65.2 67.5 71.4 + Model-II + ELMo 68.2 70.9 75.9 + Model-III 68.5 71.5 77.1 RMR + ELMo 79.4 86.8 71.4 73.7 77.0 + Model-III + ELMo 68.5 71.2 76.5 - indep-I 78.9 86.5 71.2 73.5 76.7 - indep-II 79.5 86.6 69.4 71.4 75.1 RMR + ELMo 71.4 73.7 77.0 - both 78.7 86.2 70.0 71.9 75.3 + Model-I 71.8 74.4 77.3 + Model-II 71.8 74.2 78.1 Table 2: Comparison of readers with different auxiliary + Model-II + ELMo 72.0 74.3 78.2 losses. + Model-III 72.3 74.8 78.6 + Model-III + ELMo 71.8 74.3 78.3 Configuration NoAns ACC Table 4: Comparison of readers with different answer veri- Model-I 74.5 fiers. Model-II 74.6 Model-II + ELMo 75.3 All NoAns Model-III 76.2 Configuration EM F1 ACC Model-III + ELMo 76.1 DocQA 61.9 64.8 69.1 Table 3: Comparison of different architectures for the an- + Model-III 66.5 69.2 75.2 swer verifier. DocQA + ELMo 65.1 67.6 70.6 + Model-III 68.0 70.7 76.1 mance compared to our base reader4 . Table 5: Comparison of different readers with fixed answer Ablation Study verifier. Next, we do an ablation study on the SQuAD 2.0 develop- ment set to show the effects of our proposed methods for any answer verifier can always result in considerable per- each individual component. Table 2 first shows the ablation formance gains, and combining the reader with Model-III results of different auxiliary losses on the reader. Removing obtains the best result. We find that the improvement on no- the independent span loss (indep-I) results in a performance answer accuracy is significant. This metric raises from 73.1 drop for all answerable questions (HasAns), indicating that to 77.1 after adding Model-III to RMR, increasing by 4 ab- this loss helps the model in better identifying the answer solute points. Similar observation can be found when ELMo boundary. Ablating independent no-answer loss (indep-II), embeddings are used, demonstrating that the gains are con- on the other hand, causes little influence on HasAns, but sistent and stable. leads to a severe decline on no-answer accuracy (NoAns In order to investigate how the readers affect the overall ACC). This suggests that a confliction between answer ex- performance, we fix the answer verifier as Model-III and traction and no-answer detection indeed happens. Finally, use DocQA (Clark and Gardner 2018) as the base reader deleting both of two losses causes a degradation of more instead of RMR, as shown in Table 5. We find that the ab- than 1.5 points on the overall performance in terms of F1, solute improvements are even larger: the no-answer accu- with or without ELMo embeddings. racy roughly increases by 6 points when adding Model-III Table 3 details the results of various architectures for the to DocQA (from 69.1 to 75.2), and 5.5 points when adding answer verifier. Model-III outperforms all of other competi- Model-III to DocQA + ELMo (from 70.6 to 76.1). tors, achieving a no-answer accuracy of 76.2. This illustrates Finally, we plot the precision-recall curves of F1 score on that the combination of two different architectures can bring the development set in Figure 3. We observe that RMR + in further improvement. Adding ELMo embeddings, how- ELMo + Verifier achieves the best precision when the recall ever, does not boost the performance. We hythosize that the is less than 80. After the recall exceeds 80, the precision of bytepair encoding (Sennrich, Haddow, and Birch 2016) from RMR + ELMo becomes slightly better. Ablating two auxil- Model-I and the word/character embeddings from Model-II iary losses, however, leads to an overall degradation on the have provided enough representation capacities. curve, but it still outperforms the baseline by a large margin. After doing separate ablations on each component, we then compare the performance of the whole system, as Error Analysis shown in Table 4. The combination of base reader with To perform error analysis, we first categorize all examples 4 on the development set into 5 classes: SLQA+ achieves 87.0 F1 on the SQuAD 1.1 test set, while RMR reaches 86.6. • Case1: the question is answerable, the no-answer proba-

7. Configuration Case1 ✓ Case2 ✓ Case3 ✗ Case4 ✗ Case5 ✗ RMR - both 27.8% 37.3% 6.5% 12.7% 15.7% RMR 27% 39.9% 5.9% 10.2% 17% RMR + Verifier 30.3% 38.2% 8.4% 11.8% 11.3% RMR + ELMo - both 31.5% 38.3% 5.6% 11.8% 12.8% RMR + ELMo 31.2% 40.2% 5.5% 9.9% 13.2% RMR + ELMo + Verifier 32.5% 39.8% 6.5% 10.3% 10.9% Table 6: Percentage of five categories. Correct predictions are denoted with ✓, while wrong cases are marked with ✗. Percentage 90 Phenomenon All Error 80 Negation 9% 0% Antonym 20% 8% Precision 70 Entity Swap 21% 24% Mutual Exclusion 15% 16% 60 DocQA + ELMo Impossible Condition 4% 14% Other Neutral 24% 32% RMR + ELMo - both Answerable 7% 6% 50 RMR + ELMo RMR + ELMo + Verifier 40 Table 7: Linguistic phenomena exhibited by all negative ex- 0 20 40 60 80 amples (statistics from Rajpurkar et al. (2018)) and sampled Recall error cases of RMR + ELMo + Verifier. Figure 3: Precision-Recall curves of F1 score. examples (based on F1) that are randomly sampled from the development set. Following the types of negative examples bility is less than the threshold, and the answer is correct. defined by Rajpurkar et al. (2018), we categorize the sam- • Case2: the question is unanswerable, and the no-answer pled examples and show them in Table 7. As we can see, probability is larger than the threshold. our system is good at recognize negation and antonym. The frequency of negation decreases from 9% to 0% and only • Case3: almost the same as case1, except that the predicted 4 antonym examples are predicted wrongly. We think that answer is wrong. this is because the two types are relatively easier to iden- • Case4: the question is unanswerable, but the no-answer tify. Both of negation and antonym only require to detect probability is less than the threshold. one single word in the question, such as “never” or “not” for • Case5: the question is answerable, but the no-answer negation and “increase” to “decrease” for antonym. How- probability is larger than the threshold. ever, impossible condition and other neutral types roughly acount for 46% of the error set, indicating that our system We then show the percentage of each category in Table performs less effectively on these more difficult cases. 6. As we can see, the base reader trained with auxiliary losses is notably better at case2 and case4 compared to the baseline, implying that our proposed losses help the model Related Work mainly improve upon unanswerable cases. After adding the Reading Comprehension Datasets. Various large-scale answer verifier, we observe that although the system’s per- reading comprehension datasets, such as cloze-style formance on unanswerable cases slightly decreases, the re- test (Hermann et al. 2015), answer extraction bench- sults on case1 and case5 have been improved. This demon- mark (Rajpurkar et al. 2016; Joshi et al. 2017) and answer strates that the answer verifier does well on detecting an- generation benchmark (Nguyen et al. 2016; Koˇcisk`y et al. swerable question rather than unanswerable one. Besides, 2018), have been proposed. However, these datasets still we find that the error of answer extraction is relatively small guarantee that the given context must contain an answer. Re- (6.5% for Case3 in RMR + ELMo + Verifier). However, the cently, some works construct negative examples by retriev- classification error on no-answer detection is much larger. ing passages for existing questions based on Lucene (Tan et More than 20% of examples are misclassified even with our al. 2018) and TF-IDF (Clark and Gardner 2018), or using best system (10.3% for Case4 and 10.9% for Case5 in RMR crowdworkers to craft unanswerable questions (Rajpurkar, + ELMo + Verifier). Therefore, we argue that the main per- Jia, and Liang 2018). Compared to automatically retrieved formance bottleneck lies in no-answer detection instead of negative examples, human-annotated examples are more dif- answer extraction. ficult to detect for two reasons: (1) the questions are relevant Next, to understand the challenges our approach faces, we to the passage and (2) the passage contains a plausible an- manually investigate 50 incorrectly predicted unanswerable swer to the question. Therefore, we choose to work on the

8.SQuAD 2.0 dataset in this paper. Acknowledgments Neural Networks for Reading Comprehension. Neural We would like to thank Pranav Rajpurkar and Robin Jia for reading models typically leverage various attention mecha- their helps with SQuAD 2.0 submissions. This work is sup- nisms to build interdependent representations of passage and ported by the Major State Research Development Program question, and sequentially predict the answer boundary (Seo (2016YFB0201305). et al. 2017; Hu et al. 2018a; Wang et al. 2017; Yu et al. 2018; Hu et al. 2018b). However, these approaches are not de- References signed to handle no-answer cases. To address this problem, Bowman, S. R.; Angeli, G.; Potts, C.; and Manning, C. D. previous works (Levy et al. 2017; Clark and Gardner 2018; 2015. A large annotated corpus for learning natural language Kundu and Ng 2018) predict a no-answer probability in ad- inference. In Proceedings of EMNLP. dition to the distribution over answer spans, so as to jointly Bowman, S. R.; Gauthier, J.; Rastogi, A.; Gupta, R.; Man- learn no-answer detection as well as answer extraction. Our ning, C. D.; and Potts, C. 2016. A fast unified model no-answer reader extends existing approaches by introduc- for parsing and sentence understanding. arXiv preprint ing two auxiliary losses that enhance these two tasks inde- arXiv:1603.06021. pendently. Chen, Q.; Zhu, X.; Ling, Z.; Wei, S.; Jiang, H.; and Inkpen, Recognizing Textual Entailment. Recognizing textual en- D. 2016. Enhanced lstm for natural language inference. tailment (RTE) (Dagan et al. 2010; Marelli et al. 2014), or arXiv preprint arXiv:1609.06038. known as natural language inference (NLI) (Bowman et al. 2015), requires systems to understand entailment, contra- Clark, C., and Gardner, M. 2018. Simple and effective multi- diction or semantic neutrality between two sentences. This paragraph reading comprehension. In Proceedings of ACL. task is strongly related to no-answer detection, where the Dagan, I.; Dolan, B.; Magnini, B.; and Roth, D. 2010. machine needs to understand if the passage and the ques- Recognizing textual entailment: rational, evaluation and ap- tion supports the answer. To recognize entailment, various proaches. Natural Language Engineering 16(1):105–105. branches of works have been proposed, including encoding- Hendrycks, D., and Gimpel, K. 2016. Bridging nonlinear- based approach (Bowman et al. 2016; Mou et al. 2015), ities and stochastic regularizers with gaussian error linear interaction-based approach (Parikh et al. 2016; Chen et al. units. arXiv preprint arXiv:1606.08415. 2016) and sequence-based approach (Radford et al. 2018). Hermann, K. M.; Kocisky, T.; Grefenstette, E.; Espeholt, L.; In this paper we investigate the last two branches and further Kay, W.; Suleyman, M.; and Blunsom, P. 2015. Teaching propose a hybrid architecture that combines both of them machines to read and comprehend. In Proceedings of NIPS. properly. Hochreiter, S., and Schmidhuber, J. 1997. Long short-term Answer Validation. Early answer validation task (Magnini memory. Neural computation 9(8):1735–1780. et al. 2002) aims at ranking multiple candidate answers to return a most reliable one. Later, the answer validation exer- Hu, M.; Peng, Y.; Huang, Z.; Qiu, X.; Wei, F.; and Zhou, M. cise (Rodrigo, Pe˜nas, and Verdejo 2008) has been proposed 2018a. Reinforced mnemonic reader for machine reading to decide whether an answer is correct or not according to a comprehension. In Proceedings of IJCAI. given supporting text and a question, but the dataset is too Hu, M.; Peng, Y.; Wei, F.; Huang, Z.; DongshengLi; Yang, small for neural network-based approaches. Recently, Tan et N.; and Zhou, M. 2018b. Attention-guided answer distilla- al. (2018) propose to validate the candidate answer for de- tion for machine reading comprehension. In Proceedings of tecting unanswerable questions, by comparing the question EMNLP. with the passage. Our answer verifier, on the contrary, de- Huang, H.-Y.; Zhu, C.; Shen, Y.; and Chen, W. 2018. Fu- noises the passage by comparing questions with answer sen- sionnet: fusing via fully-aware attention with application to tences, so as to focus on finding local entailment that sup- machine comprehension. In Proceedings of ICLR. ports the answer. Joshi, M.; Choi, E.; Weld, D. S.; and Zettlemoyer, L. 2017. Triviaqa: a large scale distantly supervised challenge dataset for reading comprehension. In Proceedings of ACL. Conclusion Kingma, D. P., and Ba, L. J. 2014. Adam: A method for stochastic optimization. In CoRR, abs/1412.6980. We proposed a read-then-verify system that is able to ab- Koˇcisk`y, T.; Schwarz, J.; Blunsom, P.; Dyer, C.; Hermann, stain from answering when a question has no answer given K. M.; Melis, G.; and Grefenstette, E. 2018. The narra- the passage. We first introduce two auxiliary losses to help tiveqa reading comprehension challenge. Transactions of the reader concentrate on answer extraction and no-answer ACL 6:317–328. detection respectively, and then utilize an answer verifier to validate the legitimacy of the predicted answer, in which Kundu, S., and Ng, H. T. 2018. A nil-aware answer extrac- three different architectures are investigated. Our system has tion framework for question answering. In Proceedings of achieved state-of-the-art results on the SQuAD 2.0 dataset at EMNLP, 4243–4252. the time of submission (Aug. 28th, 2018). Looking forward, Levy, O.; Seo, M.; Choi, E.; and Zettlemoyer, L. 2017. Zero- we plan to design new structures for answer verifiers to han- shot relation extraction via reading comprehension. arXiv dle questions with more complicated inferences. preprint arXiv:1706.04115.

9.Liu, P. J.; Saleh, M.; Pot, E.; Goodrich, B.; Sepassi, tion for machine reading comprehension. In Proceedings of R.; Kaiser, L.; and Shazeer, N. 2018a. Generating NLPCC, 85–97. Springer. wikipedia by summarizing long sequences. arXiv preprint Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, arXiv:1801.10198. L.; Gomez, A. N.; Kaiser, Ł.; and Polosukhin, I. 2017. At- Liu, X.; Shen, Y.; Duh, K.; and Gao, J. 2018b. Stochastic tention is all you need. In Proceedings of NIPS, 5998–6008. answer networks for machine reading comprehension. In Vinyals, O.; Fortunato, M.; and Jaitly, N. 2015. Pointer Proceedings of ACL. networks. In Proceedings of NIPS. Magnini, B.; Negri, M.; Prevete, R.; and Tanev, H. 2002. Is Wang, W.; Yang, N.; Wei, F.; Chang, B.; and Zhou, M. 2017. it the right answer? exploiting web redundancy for answer Gated self-matching networks for reading comprehension validation. In Proceedings of ACL. and question answering. In Proceedings of ACL. Marelli, M.; Menini, S.; Baroni, M.; Bentivogli, L.; Wang, W.; Yan, M.; and Wu, C. 2018. Multi-granularity Bernardi, R.; Zamparelli, R.; et al. 2014. A sick cure for the hierarchical attention fusion networks for reading compre- evaluation of compositional distributional semantic models. hension and question answering. In Proceedings of ACL. In LREC, 216–223. Yu, A. W.; Dohan, D.; Luong, M.-T.; Zhao, R.; Chen, K.; Mou, L.; Men, R.; Li, G.; Xu, Y.; Zhang, L.; Yan, R.; Norouzi, M.; and Le, Q. V. 2018. Qanet: combining local and Jin, Z. 2015. Natural language inference by tree- convolution with global self-attention for reading compre- based convolution and heuristic matching. arXiv preprint hension. In Proceedings of ICLR. arXiv:1512.08422. Nguyen, T.; Rosenberg, M.; Song, X.; Gao, J.; Tiwary, S.; Majumder, R.; and Deng, L. 2016. Ms marco: a human generated machine reading comprehension dataset. arXiv preprint arXiv:1611.09268. Parikh, A. P.; T¨ackstr¨om, O.; Das, D.; and Uszkoreit, J. 2016. A decomposable attention model for natural language inference. arXiv preprint arXiv:1606.01933. Pennington, J.; Socher, R.; and Manning, C. D. 2014. Glove: Global vectors for word representation. In Proceedings of EMNLP. Peters, M. E.; Neumann, M.; Iyyer, M.; Gardner, M.; Clark, C.; Lee, K.; and Zettlemoyer, L. 2018. Deep contextualized word prepresentations. In Proceedings of NAACL. Radford, A.; Narasimhan, K.; Salimans, T.; and Sutskever, I. 2018. Improving language understanding by generative pre-training. Rajpurkar, P.; Zhang, J.; Lopyrev, K.; and Liang, P. 2016. Squad: 100,000+ questions for machine comprehension of text. In Proceedings of EMNLP. Rajpurkar, P.; Jia, R.; and Liang, P. 2018. Know what you don’t know: unanswerable questions for squad. In Proceed- ings of ACL. ´ Pe˜nas, A.; and Verdejo, F. 2008. Overview of Rodrigo, A.; the answer validation exercise 2008. In Workshop of CLEF, 296–313. Springer. Sennrich, R.; Haddow, B.; and Birch, A. 2016. Neural ma- chine translation of rare words with subword units. In Pro- ceedings of ACL. Seo, M.; Kembhavi, A.; Farhadi, A.; and Hajishirzi, H. 2017. Bidirectional attention flow for machine comprehension. In Proceedings of ICLR. Srivastava, N.; Hinton, G.; Krizhevsky, A.; Sutskever, I.; and Salakhutdinov, R. 2014. Dropout: a simple way to prevent neural networks from overfitting. JMLR 1929–1958. Tan, C.; Wei, F.; Zhou, Q.; Yang, N.; Lv, W.; and Zhou, M. 2018. I know there is no answer: modeling answer valida-