Dynamic Memory Networks for Natural Language Processing

Most tasks in natural language processing can be cast into question answering (QA) problems over language input. We introduce the dynamic memory network (DMN), a neural network architecture which processes input sequences and questions, forms episodic memories, and generates relevant answers. Questions trigger an iterative attention process which allows the model to condition its attention on the inputs and the result of previous iterations. These results are then reasoned over in a hierarchical recurrent sequence model to generate answers. The DMN can be trained end-to-end and obtains state-of-the-art results on several types of tasks and datasets: question answering (Facebook’s bAbI dataset), text classification for sentiment analysis (Stanford Sentiment Treebank) and sequence modeling for part-of-speech tagging (WSJ-PTB). The training for these different tasks relies exclusively on trained word vector representations and input-question-answer triplets.

1. Ask Me Anything: Dynamic Memory Networks for Natural Language Processing Ankit Kumar, Peter Ondruska, Mohit Iyyer, James Bradbury, Ishaan Gulrajani, Victor Zhong, Romain Paulus, Richard Socher firstname@metamind.io, MetaMind, Palo Alto, CA USA arXiv:1506.07285v5 [cs.CL] 5 Mar 2016 Abstract I: Jane went to the hallway. I: Mary walked to the bathroom. Most tasks in natural language processing can I: Sandra went to the garden. be cast into question answering (QA) problems I: Daniel went back to the garden. over language input. We introduce the dynamic I: Sandra took the milk there. memory network (DMN), a neural network ar- Q: Where is the milk? chitecture which processes input sequences and A: garden questions, forms episodic memories, and gener- I: It started boring, but then it got interesting. ates relevant answers. Questions trigger an itera- Q: What’s the sentiment? tive attention process which allows the model to A: positive condition its attention on the inputs and the result Q: POS tags? of previous iterations. These results are then rea- A: PRP VBD JJ , CC RB PRP VBD JJ . soned over in a hierarchical recurrent sequence model to generate answers. The DMN can be Figure 1. Example inputs and questions, together with answers trained end-to-end and obtains state-of-the-art generated by a dynamic memory network trained on the corre- results on several types of tasks and datasets: sponding task. In sequence modeling tasks, an answer mechanism question answering (Facebook’s bAbI dataset), is triggered at each input word instead of only at the end. text classification for sentiment analysis (Stan- ford Sentiment Treebank) and sequence model- ing for part-of-speech tagging (WSJ-PTB). The training for these different tasks relies exclu- sively on trained word vector representations and (What is the sentiment?); even multi-sentence joint clas- input-question-answer triplets. sification problems like coreference resolution (Who does ”their” refer to?). We propose the Dynamic Memory Network (DMN), a neu- 1. Introduction ral network based framework for general question answer- Question answering (QA) is a complex natural language ing tasks that is trained using raw input-question-answer processing task which requires an understanding of the triplets. Generally, it can solve sequence tagging tasks, meaning of a text and the ability to reason over relevant classification problems, sequence-to-sequence tasks and facts. Most, if not all, tasks in natural language process- question answering tasks that require transitive reasoning. ing can be cast as a question answering problem: high The DMN first computes a representation for all inputs and level tasks like machine translation (What is the transla- the question. The question representation then triggers an tion into French?); sequence modeling tasks like named en- iterative attention process that searches the inputs and re- tity recognition (Passos et al., 2014) (NER) (What are the trieves relevant facts. The DMN memory module then rea- named entity tags in this sentence?) or part-of-speech tag- sons over retrieved facts and provides a vector representa- ging (POS) (What are the part-of-speech tags?); classifica- tion of all relevant information to an answer module which tion problems like sentiment analysis (Socher et al., 2013) generates the answer. Fig. 1 provides examples of inputs, questions and answers for tasks that are evaluated in this paper and for which a Copyright 2016 by the author(s). DMN achieves a new level of state-of-the-art performance.

2. Ask Me Anything: Dynamic Memory Networks for Natural Language Processing 2. Dynamic Memory Networks We now give an overview of the modules that make up the DMN. We then examine each module in detail and give intuitions about its formulation. A high-level illustration of the DMN is shown in Fig. 2.1. Input Module: The input module encodes raw text inputs from the task into distributed vector representations. In this paper, we focus on natural language related problems. In these cases, the input may be a sentence, a long story, a movie review, a news article, or several Wikipedia articles. Question Module: Like the input module, the question Figure 2. Overview of DMN modules. Communication between module encodes the question of the task into a distributed them is indicated by arrows and uses vector representations. vector representation. For example, in the case of question Questions trigger gates which allow vectors for certain inputs to answering, the question may be a sentence such as Where be given to the episodic memory module. The final state of the did the author first fly?. The representation is fed into the episodic memory is the input to the answer module. episodic memory module, and forms the basis, or initial state, upon which the episodic memory module iterates. of the input module. Note that in the case where the input is a single sentence, TC = TI . That is, the number of out- Episodic Memory Module: Given a collection of in- put representations is equal to the number of words in the put representations, the episodic memory module chooses sentence. In the case where the input is a list of sentences, which parts of the inputs to focus on through the attention TC is equal the number of sentences. mechanism. It then produces a ”memory” vector represen- tation taking into account the question as well as the pre- Choice of recurrent network: In our experiments, we use vious memory. Each iteration provides the module with a gated recurrent network (GRU) (Cho et al., 2014a; Chung newly relevant information about the input. In other words, et al., 2014). We also explored the more complex LSTM the module has the ability to retrieve new information, in (Hochreiter & Schmidhuber, 1997) but it performed sim- the form of input representations, which were thought to ilarly and is more computationally expensive. Both work be irrelevant in previous iterations. much better than the standard tanh RNN and we postulate that the main strength comes from having gates that allow Answer Module: The answer module generates an answer the model to suffer less from the vanishing gradient prob- from the final memory vector of the memory module. lem (Hochreiter & Schmidhuber, 1997). Assume each time A detailed visualization of these modules is shown in Fig.3. step t has an input xt and a hidden state ht . The internal mechanics of the GRU is defined as: 2.1. Input Module In natural language processing problems, the input is a se- quence of TI words w1 , . . . , wTI . One way to encode the zt = σ W (z) xt + U (z) ht−1 + b(z) (1) input sequence is via a recurrent neural network (Elman, rt = σ W (r) xt + U (r) ht−1 + b(r) (2) 1991). Word embeddings are given as inputs to the recur- rent network. At each time step t, the network updates its ˜ t = tanh W xt + rt ◦ U ht−1 + b(h) h (3) hidden state ht = RN N (L[wt ], ht−1 ), where L is the em- bedding matrix and wt is the word index of the tth word of ˜t ht = zt ◦ ht−1 + (1 − zt ) ◦ h (4) the input sequence. where ◦ is an element-wise product, W (z) , W (r) , W ∈ In cases where the input sequence is a single sentence, the RnH ×nI and U (z) , U (r) , U ∈ RnH ×nH . The dimensions input module outputs the hidden states of the recurrent net- n are hyperparameters. We abbreviate the above computa- work. In cases where the input sequence is a list of sen- tion with ht = GRU (xt , ht−1 ). tences, we concatenate the sentences into a long list of word tokens, inserting after each sentence an end-of-sentence to- 2.2. Question Module ken. The hidden states at each of the end-of-sentence to- kens are then the final representations of the input mod- Similar to the input sequence, the question is also most ule. In subsequent sections, we denote the output of the commonly given as a sequence of words in natural lan- input module as the sequence of TC fact representations c, guage processing problems. As before, we encode the whereby ct denotes the tth element in the output sequence question via a recurrent neural network. Given a question

3. Ask Me Anything: Dynamic Memory Networks for Natural Language Processing Figure 3. Real example of an input list of sentences and the attention gates that are triggered by a specific question from the bAbI tasks (Weston et al., 2015a). Gate values gti are shown above the corresponding vectors. The gates change with each search over inputs. We do not draw connections for gates that are close to zero. Note that the second iteration has wrongly placed some weight in sentence 2, which makes some intuitive sense, as sentence 2 is another place John had been. of TQ words, hidden states for the question encoder at time Need for Multiple Episodes: The iterative nature of this t is given by qt = GRU (L[wtQ ], qt−1 ), L represents the module allows it to attend to different inputs during each word embedding matrix as in the previous section and wtQ pass. It also allows for a type of transitive inference, since represents the word index of the tth word in the question. the first pass may uncover the need to retrieve additional We share the word embedding matrix across the input mod- facts. For instance, in the example in Fig. 3, we are asked ule and the question module. Unlike the input module, the Where is the football? In the first iteration, the model ought question module produces as output the final hidden state attend to sentence 7 (John put down the football.), as the of the recurrent network encoder: q = qTQ . question asks about the football. Only once the model sees that John is relevant can it reason that the second iteration 2.3. Episodic Memory Module should retrieve where John was. Similarly, a second pass may help for sentiment analysis as we show in the experi- The episodic memory module iterates over representations ments section below. outputted by the input module, while updating its internal episodic memory. In its general form, the episodic memory Attention Mechanism: In our work, we use a gating func- module is comprised of an attention mechanism as well as tion as our attention mechanism. For each pass i, the a recurrent network with which it updates its memory. Dur- mechanism takes as input a candidate fact ct , a previ- ing each iteration, the attention mechanism attends over the ous memory mi−1 , and the question q to compute a gate: fact representations c while taking into consideration the gti = G(ct , mi−1 , q). question representation q and the previous memory mi−1 The scoring function G takes as input the feature set to produce an episode ei . z(c, m, q) and produces a scalar score. We first define a The episode is then used, alongside the previous mem- large feature vector that captures a variety of similarities ories mi−1 , to update the episodic memory mi = between input, memory and question vectors: z(c, m, q) = GRU (ei , mi−1 ). The initial state of this GRU is initialized to the question vector itself: m0 = q. For some tasks, it is beneficial for episodic memory module to take multiple c, m, q, c ◦ q, c ◦ m, |c − q|, |c − m|, cT W (b) q, cT W (b) m , passes over the input. After TM passes, the final memory (5) mTM is given to the answer module. where ◦ is the element-wise product. The function G is a simple two-layer feed forward neural network

4. Ask Me Anything: Dynamic Memory Networks for Natural Language Processing G(c, m, q) = is the same. This allows for speed-up in implementation by computing these gates only once. However, gates for σ W (2) tanh W (1) z(c, m, q) + b(1) + b(2) . (6) subsequent passes will be different, as the episodes are dif- ferent. Some datasets, such as Facebook’s bAbI dataset, spec- 2.5. Training ify which facts are important for a given question. In those cases, the attention mechanism of the G function can Training is cast as a supervised classification problem to be trained in a supervised fashion with a standard cross- minimize cross-entropy error of the answer sequence. For entropy cost function. datasets with gate supervision, such as bAbI, we add the cross-entropy error of the gates into the overall cost. Be- Memory Update Mechanism: To compute the episode for cause all modules communicate over vector representations pass i, we employ a modified GRU over the sequence of the and various types of differentiable and deep neural net- inputs c1 , . . . , cTC , weighted by the gates g i . The episode works with gates, the entire DMN model can be trained vector that is given to the answer module is the final state via backpropagation and gradient descent. of the GRU. The equation to update the hidden states of the GRU at time t and the equation to compute the episode are, respectively: 3. Related Work hit = gti GRU (ct , hit−1 ) + (1 − gti )hit−1 (7) Given the many shoulders on which this paper is standing and the many applications to which our model is applied, it ei = hiTC (8) is impossible to do related fields justice. Deep Learning: There are several deep learning models Criteria for Stopping: The episodic memory module also that have been applied to many different tasks in NLP. has a signal to stop iterating over inputs. To achieve this, For instance, recursive neural networks have been used for we append a special end-of-passes representation to the in- parsing (Socher et al., 2011), sentiment analysis (Socher put, and stop the iterative attention process if this represen- et al., 2013), paraphrase detection (Socher et al., 2011) and tation is chosen by the gate function. For datasets without question answering (Iyyer et al., 2014) and logical infer- explicit supervision, we set a maximum number of itera- ence (Bowman et al., 2014), among other tasks. However, tions. The whole module is end-to-end differentiable. because they lack the memory and question modules, a sin- gle model cannot solve as many varied tasks, nor tasks that 2.4. Answer Module require transitive reasoning over multiple sentences. An- The answer module generates an answer given a vector. other commonly used model is the chain-structured recur- Depending on the type of task, the answer module is ei- rent neural network of the kind we employ above. Recur- ther triggered once at the end of the episodic memory or at rent neural networks have been successfully used in lan- each time step. guage modeling (Mikolov & Zweig, 2012), speech recog- nition, and sentence generation from images (Karpathy & We employ another GRU whose initial state is initialized to Fei-Fei, 2015). Also relevant is the sequence-to-sequence the last memory a0 = mTM . At each timestep, it takes as model used for machine translation by Sutskever et al. input the question q, last hidden state at−1 , as well as the (Sutskever et al., 2014). This model uses two extremely previously predicted output yt−1 . large and deep LSTMs to encode a sentence in one lan- guage and then decode the sentence in another language. yt = softmax(W (a) at ) (9) This sequence-to-sequence model is a special case of the at = GRU ([yt−1 , q], at−1 ), (10) DMN without a question and without episodic memory. Instead it maps an input sequence directly to an answer se- where we concatenate the last generated word and the ques- quence. tion vector as the input at each time step. The output is trained with the cross-entropy error classification of the Attention and Memory: The second line of work that correct sequence appended with a special end-of-sequence is very relevant to DMNs is that of attention and mem- token. ory in deep learning. Attention mechanisms are generally useful and can improve image classification (Stollenga & In the sequence modeling task, we wish to label each word J. Masci, 2014), automatic image captioning (Xu et al., in the original sequence. To this end, the DMN is run in 2015) and machine translation (Cho et al., 2014b; Bah- the same way as above over the input words. For word t, danau et al., 2014). Neural Turing machines use memory we replace Eq. 8 with ei = hit . Note that the gates for the to solve algorithmic problems such as list sorting (Graves first pass will be the same for each word, as the question

5. Ask Me Anything: Dynamic Memory Networks for Natural Language Processing et al., 2014). The work of recent months by Weston et Sentiment analysis is a very useful classification task and al. on memory networks (Weston et al., 2015b) focuses recently the Stanford Sentiment Treebank (Socher et al., on adding a memory component for natural language ques- 2013) has become a standard benchmark dataset. Kim tion answering. They have an input (I) and response (R) (Kim, 2014) reports the previous state-of-the-art result component and their generalization (G) and output feature based on a convolutional neural network that uses multi- map (O) components have some functional overlap with ple word vector representations. The previous best model our episodic memory. However, the Memory Network can- for part-of-speech tagging on the Wall Street Journal sec- not be applied to the same variety of NLP tasks since it tion of the Penn Tree Bank (Marcus et al., 1993) was So- processes sentences independently and not via a sequence gaard (Søgaard, 2011) who used a semisupervised nearest model. It requires bag of n-gram vector features as well neighbor approach. We also directly compare to paragraph as a separate feature that captures whether a sentence came vectors by (Le & Mikolov., 2014). before another one. Neuroscience: The episodic memory in humans stores Various other neural memory or attention architectures specific experiences in their spatial and temporal context. have recently been proposed for algorithmic problems For instance, it might contain the first memory somebody (Joulin & Mikolov, 2015; Kaiser & Sutskever, 2015), cap- has of flying a hang glider. Eichenbaum and Cohen have ar- tion generation for images (Malinowski & Fritz, 2014; gued that episodic memories represent a form of relation- Chen & Zitnick, 2014), visual question answering (Yang ship (i.e., relations between spatial, sensory and temporal et al., 2015) or other NLP problems and datasets (Hermann information) and that the hippocampus is responsible for et al., 2015). general relational learning (Eichenbaum & Cohen, 2004). Interestingly, it also appears that the hippocampus is active In contrast, the DMN employs neural sequence models for during transitive inference (Heckers et al., 2004), and dis- input representation, attention, and response mechanisms, ruption of the hippocampus impairs this ability (Dusek & thereby naturally capturing position and temporality. As a Eichenbaum, 1997). result, the DMN is directly applicable to a broader range of applications without feature engineering. We compare The episodic memory module in the DMN is inspired by directly to Memory Networks on the bAbI dataset (Weston these findings. It retrieves specific temporal states that et al., 2015a). are related to or triggered by a question. Furthermore, we found that the GRU in this module was able to do NLP Applications: The DMN is a general model which some transitive inference over the simple facts in the bAbI we apply to several NLP problems. We compare to what, dataset. This module also has similarities to the Temporal to the best of our knowledge, is the current state-of-the-art Context Model (Howard & Kahana, 2002) and its Bayesian method for each task. extensions (Socher et al., 2009) which were developed to There are many different approaches to question answer- analyze human behavior in word recall experiments. ing: some build large knowledge bases (KBs) with open in- formation extraction systems (Yates et al., 2007), some use 4. Experiments neural networks, dependency trees and KBs (Bordes et al., 2012), others only sentences (Iyyer et al., 2014). A lot of We include experiments on question answering, part-of- other approaches exist. When QA systems do not produce speech tagging, and sentiment analysis. The model is the right answer, it is often unclear if it is because they trained independently for each problem, while the archi- do not have access to the facts, cannot reason over them tecture remains the same except for the answer module and or have never seen this type of question or phenomenon. input fact subsampling (words vs sentences). The answer Most QA dataset only have a few hundred questions and module, as described in Section 2.4, is triggered either once answers but require complex reasoning. They can hence at the end or for each token. not be solved by models that have to learn purely from ex- For all datasets we used either the official train, devel- amples. While synthetic datasets (Weston et al., 2015a) opment, test splits or if no development set was defined, have problems and can often be solved easily with manual we used 10% of the training set for development. Hyper- feature engineering, they let us disentangle failure modes parameter tuning and model selection (with early stopping) of models and understand necessary QA capabilities. They is done on the development set. The DMN is trained via are useful for analyzing models that attempt to learn every- backpropagation and Adam (Kingma & Ba, 2014). We thing and do not rely on external features like coreference, employ L2 regularization, and dropout on the word em- POS, parsing, logical rules, etc. The DMN is such a model. beddings. Word vectors are pre-trained using GloVe (Pen- Another related model by Andreas et al. (2016) combines nington et al., 2014). neural and logical reasoning for question answering over knowledge bases and visual question answering.

6. Ask Me Anything: Dynamic Memory Networks for Natural Language Processing Task MemNN DMN Task Binary Fine-grained 1: Single Supporting Fact 100 100 MV-RNN 82.9 44.4 2: Two Supporting Facts 100 98.2 RNTN 85.4 45.7 3: Three Supporting Facts 100 95.2 DCNN 86.8 48.5 4: Two Argument Relations 100 100 PVec 87.8 48.7 5: Three Argument Relations 98 99.3 CNN-MC 88.1 47.4 6: Yes/No Questions 100 100 DRNN 86.6 49.8 7: Counting 85 96.9 CT-LSTM 88.0 51.0 8: Lists/Sets 91 96.5 DMN 88.6 52.1 9: Simple Negation 100 100 10: Indefinite Knowledge 98 97.5 Table 2. Test accuracies for sentiment analysis on the Stanford 11: Basic Coreference 100 99.9 Sentiment Treebank. MV-RNN and RNTN: Socher et al. (2013). 12: Conjunction 100 100 DCNN: Kalchbrenner et al. (2014). PVec: Le & Mikolov. (2014). 13: Compound Coreference 100 99.8 CNN-MC: Kim (2014). DRNN: Irsoy & Cardie (2015), 2014. 14: Time Reasoning 99 100 CT-LSTM: Tai et al. (2015) 15: Basic Deduction 100 100 16: Basic Induction 100 99.4 17: Positional Reasoning 65 59.6 We list results in Table 1. The DMN does worse than 18: Size Reasoning 95 95.3 the Memory Network, which we refer to from here on as 19: Path Finding 36 34.5 MemNN, on tasks 2 and 3, both tasks with long input se- 20: Agent’s Motivations 100 100 quences. We suspect that this is due to the recurrent input sequence model having trouble modeling very long inputs. Mean Accuracy (%) 93.3 93.6 The MemNN does not suffer from this problem as it views each sentence separately. The power of the episodic mem- Table 1. Test accuracies on the bAbI dataset. MemNN numbers taken from Weston et al. (Weston et al., 2015a). The DMN passes ory module is evident in tasks 7 and 8, where the DMN (accuracy > 95%) 18 tasks, whereas the MemNN passes 16. significantly outperforms the MemNN. Both tasks require the model to iteratively retrieve facts and store them in a representation that slowly incorporates more of the rele- vant information of the input sequence. Both models do 4.1. Question Answering poorly on tasks 17 and 19, though the MemNN does better. The Facebook bAbI dataset is a synthetic dataset for test- We suspect this is due to the MemNN using n-gram vectors ing a model’s ability to retrieve facts and reason over them. and sequence position features. Each task tests a different skill that a question answering model ought to have, such as coreference resolution, de- 4.2. Text Classification: Sentiment Analysis duction, and induction. Showing an ability exists here is The Stanford Sentiment Treebank (SST) (Socher et al., not sufficient to conclude a model would also exhibit it on 2013) is a popular dataset for sentiment classification. It real world text data. It is, however, a necessary condition. provides phrase-level fine-grained labels, and comes with a Training on the bAbI dataset uses the following objective train/development/test split. We present results on two for- function: J = αECE (Gates) + βECE (Answers), where mats: fine-grained root prediction, where all full sentences ECE is the standard cross-entropy cost and α and β are hy- (root nodes) of the test set are to be classified as either very perparameters. In practice, we begin training with α set to negative, negative, neutral, positive, or very positive, and 1 and β set to 0, and then later switch β to 1 while keep- binary root prediction, where all non-neutral full sentences ing α at 1. As described in Section 2.1, the input module of the test set are to be classified as either positive or neg- outputs fact representations by taking the encoder hidden ative. To train the model, we use all full sentences as well states at time steps corresponding to the end-of-sentence to- as subsample 50% of phrase-level labels every epoch. Dur- kens. The gate supervision aims to select one sentence per ing evaluation, the model is only evaluated on the full sen- pass; thus, we also experimented with modifying Eq. 8 to tences (root setup). In binary classification, neutral phrases a simple softmax instead of a GRU. Here, we compute the are removed from the dataset. The DMN achieves state-of- T final episode vector via: ei = t=1 softmax(gti )ct , where the-art accuracy on the binary classification task, as well as exp(gti ) on the fine-grained classification task. softmax(gti ) = T exp(gji ) , and gti here is the value of j=1 the gate before the sigmoid. This setting achieves better re- In all experiments, the DMN was trained with GRU se- sults, likely because the softmax encourages sparsity and is quence models. It is easy to replace the GRU sequence better suited to picking one sentence at a time. model with any of the models listed above, as well as in-

7. Ask Me Anything: Dynamic Memory Networks for Natural Language Processing Model Acc (%) Max task 3 task 7 task 8 sentiment passes three-facts count lists/sets (fine grain) SVMTool 97.15 Sogaard 97.27 0 pass 0 48.8 33.6 50.0 Suzuki et al. 97.40 1 pass 0 48.8 54.0 51.5 Spoustova et al. 97.44 2 pass 16.7 49.1 55.6 52.1 SCNN 97.50 3 pass 64.7 83.4 83.4 50.1 5 pass 95.2 96.9 96.5 N/A DMN 97.56 Table 3. Test accuracies on WSJ-PTB Table 4. Effectiveness of episodic memory module across tasks. Each row shows the final accuracy in term of percentages with a different maximum limit for the number of passes the episodic corporate tree structure in the retrieval process. memory module can take. Note that for the 0-pass DMN, the network essential reduces to the output of the attention module. 4.3. Sequence Tagging: Part-of-Speech Tagging hard examples with mixed positive/negative vocabulary. Part-of-speech tagging is traditionally modeled as a se- quence tagging problem: every word in a sentence is to 4.5. Qualitative Analysis of Episodic Memory Module be classified into its part-of-speech class (see Fig. 1). We evaluate on the standard Wall Street Journal dataset (Mar- Apart from a quantitative analysis, we also show qualita- cus et al., 1993). We use the standard splits of sections tively what happens to the attention during multiple passes. 0-18 for training, 19-21 for development and 22-24 for test We present specific examples from the experiments to illus- sets (Søgaard, 2011). Since this is a word level tagging trate that the iterative nature of the episodic memory mod- task, DMN memories are classified at each time step corre- ule enables the model to focus on relevant parts of the input. sponding to each word. This is described in detail in Sec- For instance, Table 5 shows an example of what the DMN tion 2.4’s discussion of sequence modeling. focuses on during each pass of a three-iteration scan on a question from the bAbI dataset. We compare the DMN with the results in (Søgaard, 2011). The DMN achieves state-of-the-art accuracy with a single We also evaluate the episodic memory module for senti- model, reaching a development set accuracy of 97.5. En- ment analysis. Given that the DMN performs well with sembling the top 4 development models, the DMN gets to both one iteration and two iterations, we study test exam- 97.58 dev and 97.56 test accuracies, achieving a slightly ples where the one-iteration DMN is incorrect and the two- higher new state-of-the-art (Table 3). episode DMN is correct. Looking at the sentences in Fig. 4 and 5, we make the following observations: 4.4. Quantitative Analysis of Episodic Memory Module 1. The attention of the two-iteration DMN is generally The main novelty of the DMN architecture is in its episodic much more focused compared to that of the one- memory module. Hence, we analyze how important the iteration DMN. We believe this is due to the fact that episodic memory module is for NLP tasks and in particular with fewer iterations over the input, the hidden states how the number of passes over the input affect accuracy. of the input module encoder have to capture more of Table 4 shows the accuracies on a subset of bAbI tasks as the content of adjacent time steps. Hence, the atten- well as on the Stanford Sentiment Treebank. We note that tion mechanism cannot only focus on a few key time for several of the hard reasoning tasks, multiple passes over steps. Instead, it needs to pass all necessary informa- the inputs are crucial to achieving high performance. For tion to the answer module from a single pass. sentiment the differences are smaller. However, two passes 2. During the second iteration of the two-iteration DMN, outperform a single pass or zero passes. In the latter case, the attention becomes significantly more focused on there is no episodic memory at all and outputs are passed relevant key words and less attention is paid to strong directly from the input module to the answer module. We sentiment words that lose their sentiment in context. note that, especially complicated examples are more of- This is exemplified by the sentence in Fig. 5 that in- ten correctly classified with 2 passes but many examples cludes the very positive word ”best.” In the first iter- in sentiment contain only simple sentiment words and no ation, the word ”best” dominates the attention scores negation or misleading expressions. Hence the need to have (darker color means larger score). However, once its a complicated architecture for them is small. The same is context, ”is best described”, is clear, its relevance is true for POS tagging. Here, differences in accuracy are less diminished and ”lukewarm” becomes more important. than 0.1 between different numbers of passes. Next, we show that the additional correct classifications are We conclude that the ability of the episodic memory mod-

8. Ask Me Anything: Dynamic Memory Networks for Natural Language Processing Question: Where was Mary before the Bedroom? Answer: Cinema. Facts Episode 1 Episode 2 Episode 3 Yesterday Julie traveled to the school. Yesterday Marie went to the cinema. This morning Julie traveled to the kitchen. Bill went back to the cinema yesterday. Mary went to the bedroom this morning. Julie went back to the bedroom this afternoon. [done reading] Table 5. An example of what the DMN focuses on during each episode on a real query in the bAbI task. Darker colors mean that the attention weight is higher. 1-iter DMN (pred: negative, ans: positive) 1-iter DMN (pred: negative, ans: positive) 1 1 filme rts co out un peteas rem nt ark but le gra an... du d groally som i ws e nto co thing ra f po ble r . ide o we Th d ing In rag its ch , un a p ass nd y , the wo e rks . ab sta wa vi ge ea mo um m ns 2-iter DMN (pred: positive, ans: positive) 2-iter DMN (pred: positive, ans: positive) 2 2 1 1 filme rts co out un peteas rem nt ark but le gra an... du d groally som i ws e nto co thing ra f po ble r . ide o we Th ab sta In rag its d ch , un a p ass nd ing y , the wo e rks . wa vi ge ea mo um m ns 1-iter DMN (pred: very positive, ans: negative) 1-iter DMN (pred: positive, ans: negative) 1 1 be e wa t y ho to pe for ch any en oe joy f thig films low bis ex eriny pe y g tio r ns . s cta ou Th c res My e to the film is de st d luk as arm . in an ns ibe b e po ew 2-iter DMN (pred: negative, ans: negative) scr 2-iter DMN (pred: negative, ans: negative) 2 2 1 1 be e wa t y o pe en oe for ch any joy f thig films low b s ex riny pe y g tio r ns . s cta ou i t Th c in an ho res My e to the film is de est d luk as arm . e ns ibe b po ew scr Figure 4. Attention weights for sentiment examples that were Figure 5. These sentence demonstrate cases where initially posi- only labeled correctly by a DMN with two episodes. The y-axis tive words lost their importance after the entire sentence context shows the episode number. This sentence demonstrates a case became clear either through a contrastive conjunction (”but”) or a where the ability to iterate allows the DMN to sharply focus on modified action ”best described.” relevant words. 5. Conclusion ule to perform multiple passes over the data is beneficial. It The DMN model is a potentially general architecture for a provides significant benefits on harder bAbI tasks, which variety of NLP applications, including classification, ques- require reasoning over several pieces of information or tion answering and sequence modeling. A single architec- transitive reasoning. Increasing the number of passes also ture is a first step towards a single joint model for multi- slightly improves the performance on sentiment analysis, ple NLP problems. The DMN is trained end-to-end with though the difference is not as significant. We did not at- one, albeit complex, objective function. Future work will tempt more iterations for sentiment analysis as the model explore additional tasks, larger multi-task models and mul- struggles with overfitting with three passes. timodal inputs and questions.

9. Ask Me Anything: Dynamic Memory Networks for Natural Language Processing References Hochreiter, S. and Schmidhuber, J. Long short-term mem- ory. Neural Computation, 9(8):1735–1780, Nov 1997. Andreas, J., Rohrbach, M., Darrell, T., and Klein, D. ISSN 0899-7667. Learning to Compose Neural Networks for Question An- swering. arXiv preprint arXiv:1601.01705, 2016. Howard, Marc W. and Kahana, Michael J. A distributed Bahdanau, D., Cho, K., and Bengio, Y. Neural machine representation of temporal context. Journal of Mathe- translation by jointly learning to align and translate. matical Psychology, 46(3):269 – 299, 2002. CoRR, abs/1409.0473, 2014. Irsoy, O. and Cardie, C. Modeling compositionality with Bordes, A., Glorot, X., Weston, J., and Bengio, Y. Joint multiplicative recurrent neural networks. In ICLR, 2015. Learning of Words and Meaning Representations for Iyyer, M., Boyd-Graber, J., Claudino, L., Socher, R., and Open-Text Semantic Parsing. AISTATS, 2012. Daum´e III, H. A neural network for factoid question Bowman, S. R., Potts, C., and Manning, C. D. Recursive answering over paragraphs. In EMNLP, 2014. neural networks for learning logical semantics. CoRR, abs/1406.1827, 2014. Joulin, A. and Mikolov, T. Inferring algorithmic patterns with stack-augmented recurrent nets. In NIPS, 2015. Chen, X. and Zitnick, C. L. Learning a recurrent visual rep- resentation for image caption generation. arXiv preprint Kaiser, L. and Sutskever, I. Neural GPUs Learn Algo- arXiv:1411.5654, 2014. rithms. arXiv preprint arXiv:1511.08228, 2015. Cho, K., van Merrienboer, B., Bahdanau, D., and Ben- Kalchbrenner, N., Grefenstette, E., and Blunsom, P. A con- gio, Y. On the properties of neural machine translation: volutional neural network for modelling sentences. In Encoder-decoder approaches. CoRR, abs/1409.1259, ACL, 2014. 2014a. Karpathy, A. and Fei-Fei, L. Deep visual-semantic align- Cho, K., van Merrienboer, B., Gulcehre, C., Bahdanau, D., ments for generating image descriptions. In CVPR, Bougares, F., Schwenk, H., and Bengio, Y. Learning 2015. Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation. In EMNLP, 2014b. Kim, Y. Convolutional neural networks for sentence clas- sification. In EMNLP, 2014. Chung, J., G¨ulc¸ehre, C ¸ ., Cho, K., and Bengio, Y. Empiri- cal evaluation of gated recurrent neural networks on se- Kingma, P. and Ba, Jimmy. Adam: A method for stochastic quence modeling. CoRR, abs/1412.3555, 2014. optimization. CoRR, abs/1412.6980, 2014. Dusek, J. A. and Eichenbaum, H. The hippocampus and Le, Q.V. and Mikolov., T. Distributed representations of memory for orderly stimulusrelations. Proceedings of sentences and documents. In ICML, 2014. the National Academy of Sciences, 94(13):7109–7114, 1997. Malinowski, M. and Fritz, M. A Multi-World Approach to Question Answering about Real-World Scenes based on Eichenbaum, H. and Cohen, N. J. From Conditioning to Uncertain Input. In NIPS, 2014. Conscious Recollection: Memory Systems of the Brain (Oxford Psychology). Oxford University Press, 1 edition, Marcus, M. P., Marcinkiewicz, M. A., and Santorini, B. 2004. ISBN 0195178041. Building a large annotated corpus of english: The penn treebank. Computational Linguistics, 19(2), June 1993. Elman, J. L. Distributed representations, simple recurrent networks, and grammatical structure. Machine Learn- Mikolov, T. and Zweig, G. Context dependent recurrent ing, 7(2-3):195–225, 1991. neural network language model. In SLT, pp. 234–239. Graves, A., Wayne, G., and Danihelka, I. Neural turing IEEE, 2012. ISBN 978-1-4673-5125-6. machines. CoRR, abs/1410.5401, 2014. Passos, A., Kumar, V., and McCallum, A. Lexicon infused Heckers, S., Zalesak, M., Weiss, A. P., Ditman, T., and phrase embeddings for named entity resolution. In Con- Titone, D. Hippocampal activation during transitive in- ference on Computational Natural Language Learning. ference in humans. Hippocampus, 14:153–62, 2004. Association for Computational Linguistics, June 2014. Hermann, K. M., Koˇcisk´y, T., Grefenstette, E., Espeholt, Pennington, J., Socher, R., and Manning, C. D. Glove: L., Kay, W., Suleyman, M., and Blunsom, P. Teaching Global vectors for word representation. In EMNLP, machines to read and comprehend. In NIPS, 2015. 2014.

10. Ask Me Anything: Dynamic Memory Networks for Natural Language Processing Socher, R., Gershman, S., Perotte, A., Sederberg, P., Blei, D., and Norman, K. A bayesian analysis of dynamics in free recall. In NIPS. 2009. Socher, R., Huang, E. H., Pennington, J., Ng, A. Y., and Manning, C. D. Dynamic Pooling and Unfolding Recur- sive Autoencoders for Paraphrase Detection. In NIPS, 2011. Socher, R., Perelygin, A., Wu, J., Chuang, J., Manning, C., Ng, A., and Potts, C. Recursive deep models for semantic compositionality over a sentiment treebank. In EMNLP, 2013. Søgaard, A. Semisupervised condensed nearest neighbor for part-of-speech tagging. In ACL-HLT, 2011. Stollenga, M. F. and J. Masci, F. Gomez, J. Schmidhu- ber. Deep Networks with Internal Selective Attention through Feedback Connections. In NIPS, 2014. Sutskever, I., Vinyals, O., and Le, Q. V. Sequence to se- quence learning with neural networks. In NIPS, 2014. Tai, K. S., Socher, R., and Manning, C. D. Improved se- mantic representations from tree-structured long short- term memory networks. In ACL, 2015. Weston, J., Bordes, A., Chopra, S., and Mikolov, T. To- wards ai-complete question answering: A set of prereq- uisite toy tasks. CoRR, abs/1502.05698, 2015a. Weston, J., Chopra, S., and Bordes, A. Memory networks. In ICLR, 2015b. Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A. C., Salakhutdinov, R., Zemel, R. S., and Bengio, Y. Show, attend and tell: Neural image caption generation with vi- sual attention. CoRR, abs/1502.03044, 2015. Yang, Z., He, X., Gao, J., Deng, L., and Smola, A. Stacked attention networks for image question answering. arXiv preprint arXiv:1511.02274, 2015. Yates, A., Banko, M., Broadhead, M., Cafarella, M. J., Et- zioni, O., and Soderland, S. Textrunner: Open informa- tion extraction on the web. In HLT-NAACL (Demonstra- tions), 2007.