Andreas Neural Module Networks

Visual question answering is fundamentally compositional in nature—a question like where is the dog? shares sub-structure with questions like what color is the dog? and where is the cat? This paper seeks to simultaneously exploit the representational capacity of deep networks and the compositional linguistic structure of questions. We describe a procedure for constructing and learning neural module networks, which compose collections of jointly-trained neural “modules” into deep networks for question answering. Our approach decomposes questions into their linguistic sub-structures, and uses these structures to dynamically instantiate modular networks (with reusable components for recognizing dogs, classifying colors, etc.). The resulting compound networks are jointly trained. We evaluate our approach on two challenging datasets for visual question answering, achieving state-of-the-art results on both the VQA natural image dataset and a new dataset of complex questions about abstract shapes.

1. Neural Module Networks Jacob Andreas Marcus Rohrbach Trevor Darrell Dan Klein University of California, Berkeley {jda,rohrbach,trevor,klein} Abstract Where is LSTM couch the dog? Visual question answering is fundamentally composi- tional in nature—a question like where is the dog? shares substructure with questions like what color is the dog? and count where color ... Parser Layout where is the cat? This paper seeks to simultaneously exploit the representational capacity of deep networks and the com- dog cat standing ... positional linguistic structure of questions. We describe a procedure for constructing and learning neural module net- works, which compose collections of jointly-trained neural CNN “modules” into deep networks for question answering. Our approach decomposes questions into their linguistic sub- structures, and uses these structures to dynamically instan- tiate modular networks (with reusable components for rec- Figure 1: A schematic representation of our proposed model—the ognizing dogs, classifying colors, etc.). The resulting com- shaded gray area is a neural module network of the kind introduced in this paper. Our approach uses a natural language parser to dy- pound networks are jointly trained. We evaluate our ap- namically lay out a deep network composed of reusable modules. proach on two challenging datasets for visual question an- For visual question answering tasks, an additional sequence model swering, achieving state-of-the-art results on both the VQA provides sentence context and learns common-sense knowledge. natural image dataset and a new dataset of complex ques- tions about abstract shapes. sifier on the encoded question and image. In contrast to these monolithic approaches, another line of work for tex- 1. Introduction tual QA [23] and image QA [27] uses semantic parsers to This paper describes an approach to visual question an- decompose questions into logical expressions. These logi- swering based on a new model architecture that we call a cal expressions are evaluated against a purely logical repre- neural module network (NMN). This architecture makes it sentation of the world, which may be provided directly or possible to answer natural language questions about images extracted from an image [21]. using collections of jointly-trained neural “modules”, dy- In this paper we draw from both lines of research, namically composed into deep networks based on linguistic presenting a technique for integrating the representational structure. power of neural networks with the flexible compositional Concretely, given an image and an associated question structure afforded by symbolic approaches to semantics. (e.g. where is the dog?), we wish to predict a correspond- Rather than relying on a monolithic network structure to ing answer (e.g. on the couch, or perhaps just couch) (Fig- answer all questions, our approach assembles a network on ure 1). The visual question answering task has significant the fly from a collection of specialized, jointly-learned mod- significant applications to human-robot interaction, search, ules (Figure 1). Rather than using logic to reason over truth and accessibility, and has been the subject of a great deal of values, the representations computed by our model remain recent research attention [3, 10, 26, 28, 33, 40]. The task entirely in the domain of visual features and attentions. requires sophisticated understanding of both visual scenes Our approach first analyzes each question with a seman- and natural language. Recent successful approaches repre- tic parser, and uses this analysis to determine the basic com- sent questions as bags of words, or encode the question us- putational units (attention, classification, etc.) needed to an- ing a recurrent neural network [28] and train a simple clas- swer the question, as well as the relationships between these 39

2.units. In Figure 1, we first produce an attention focused on for classification [12]. This has been shown to substantially the dog, which passes its output to a location describer. De- reduce training time and improve accuracy. So while net- pending on the underlying structure, these messages passed work structures are not universal (in the sense that the same between modules may be raw image features, attentions, or network is appropriate for all problems), they are at least classification decisions; each module maps from specific in- empirically modular (in the sense that intermediate repre- put to output types. Different kinds of modules are shown sentations for one task are useful for many others). in different colors; attention-producing modules (like dog) Can we generalize this idea in a way that is useful for are shown in green, while labeling modules (like where) are question answering? Rather than thinking of question an- shown in blue. Importantly, all modules in an NMN are swering as a problem of learning a single function to map independent and composable, which allows the computa- from questions and images to answers, it is perhaps useful tion to be different for each problem instance, and possibly to think of it as a highly-multitask learning setting, where unobserved during training. Outside the NMN, our final an- each problem instance is associated with a novel task, and swer uses a recurrent network (LSTM) to read the question, the identity of that task is expressed only noisily in lan- an additional step which has been shown to be important guage. In particular, where a simple question like is this a for modeling common sense knowledge and dataset biases truck? requires us to retrieve only one piece of information [28]. from an image, more complicated questions, like how many We evaluate our approach on two visual question an- objects are to the left of the toaster? might require multi- swering tasks. On the recently-released VQA [3] dataset ple processing steps. The compositional nature of language we achieve results comparable to or better than existing ap- means that the number of such processing such steps is po- proaches. However, that many of the questions in the VQA tentially unbounded. Moreover, multiple kinds of process- dataset are quite simple, with little composition or reason- ing might be required—repeated convolutions might iden- ing required. To test our approach’s ability to handle harder tify a truck, but some kind of recurrent architecture is likely questions, we introduce a new dataset of synthetic images necessary to count up to arbitrary numbers. paired with complex questions involving spatial relations, Thus our goal in this paper is to specify a framework set-theoretic reasoning, and shape and attribute recognition. for modular, composable, jointly-trained neural networks. On this dataset we outperform baseline approaches by as In this framework, we first predict the structure of the much as 25% absolute accuracy. computation needed to answer each question individually, While all the applications considered in this paper in- then realize this structure by constructing an appropriately- volve visual question answering, the architecture is much shaped neural network from an inventory of reusable mod- more general, and might easily be applied to visual referring ules. These modules are learned jointly, rather than trained expression resolution [9, 34] or question answering about in isolation, and specialization to individual tasks (identify- natural language texts [15]. ing properties, spatial relations, etc.) arises naturally from To summarize our contributions: We first describe neural the training objective. module networks, a general architecture for discretely com- posing heterogeneous, jointly-trained neural modules into 3. Related work deep networks. Next, for the visual QA task specifically, we show how to construct NMNs based on the output of Visual Question Answering Answering questions about a semantic parser, and use these to successfully complete images is sometimes referred to as a “Visual Turing Test” established visual question answering tasks. Finally, we in- [27, 11]. It has only recently gained popularity, following troduce a new dataset of challenging, highly compositional the emergence of appropriate datasets consisting of paired questions about abstract shapes, and show that our model images, questions, and answers. While the DAQUAR again outperforms previous approaches. We have released dataset [27] is restricted to indoor scenes and contains rel- the dataset, as well as code for the system described in this atively few examples, the C OCO QA dataset [40] and the paper, at VQA dataset [3] are significantly larger and have more vi- sual variety. Both are based on images from the COCO 2. Motivations dataset [24]. While C OCO QA contains question-answer pairs automatically generated from the descriptions asso- We begin with two simple observations. First, state-of- ciated with the COCO dataset, [3] has crowed sourced the-art performance on the full range of computer vision questions-answer pairs. We evaluate our approach on VQA, tasks that are studied requires a variety of different deep the larger and more natural of the two datasets. network topologies—there is no single “best network” for Notable “classical” approaches to this task include [27, all tasks. Second, though different networks are used for 21]. Both of these approaches are similar to ours in their different purposes, it is commonplace to initialize systems use of a semantic parser, but rely on fixed logical inference for many of vision tasks with a prefix of a network trained rather than learned compositional operations. 40

3. Several neural models for visual questioning have al- 4. Neural module networks for visual QA ready been proposed in the literature [33, 26, 10], all of which use standard deep sequence modeling machinery to Each training datum for this task can be thought of as a construct a joint embedding of image and text, which is im- 3-tuple (w, x, y), where mediately mapped to a distribution over answers. Here we • w is a natural-language question attempt to more explicitly model the computational process needed to produce each answer, but benefit from techniques • x is an image for producing sequence and image embeddings that have • y is an answer been important in previous work. One important component of visual questioning is A model is fully specified by a collection of modules {m}, grounding the question in the image. This grounding task each with associated parameters θm , and a network layout has previously been approached in [18, 32, 17, 20, 14], predictor P which maps from strings to networks. Given where the authors tried to localize phrases in an image. [39] (w, x) as above, the model instantiates a network based on use an attention mechanism to predict a heatmap for each P (w), passes x (and possibly w again) as inputs, and ob- word during sentence generation. The attentional compo- tains a distribution over labels (for the VQA task, we re- nent of our model is inspired by these approaches. quire the output module produce an answer representation). Thus a model ultimately encodes a predictive distribution p(y | w, x; θ). General compositional semantics There is a large lit- In the remainder of this section, we describe the set of erature on learning to answer questions about structured modules used for the VQA task, then explain the process knowledge representations from question–answer pairs, by which questions are converted to network layouts. both with and without joint learning of meanings for sim- ple predicates [23, 21]. Outside of question answering, sev- 4.1. Modules eral models have been proposed for instruction following that impose a discrete “planning structure” over an under- Our goal here is to identify a small set of modules that lying continuous control signal [1, 30]. We are unaware of can be assembled into all the configurations necessary for past use of a semantic parser to predict network structures, our tasks. This corresponds to identifying a minimal set or more generally to exploit the natural similarity between of composable vision primitives. The modules operate on set-theoretic approaches to classical semantic parsing and three basic data types: images, unnormalized attentions, attentional approaches to computer vision. and labels. For the particular task and modules described in this paper, almost all interesting compositional phenom- ena occur in the space of attentions, and it is not unreason- Neural network architectures The idea of selecting a able to characterize our contribution more narrowly as an different network graph for each input datum is fundamen- “attention-composition” network. Nevertheless, other types tal to both recurrent networks (where the network grows in may be easily added in the future (for new applications or the length of the input) [8] and recursive neural networks for greater coverage in the VQA domain). (where the network is built, e.g., according to the syntactic First, some notation: module names are typeset structure of the input) [36]. But both of these approaches in a fixed width font, and are of the form ultimately involve repeated application of a single com- TYPE[INSTANCE](ARG1 , . . .). TYPE is a high-level module putational module (e.g. an LSTM [13] or GRU [5] cell). type (attention, classification, etc.) of the kind described be- From another direction, some kinds of memory networks low. INSTANCE is the particular instance of the model under [38] may be viewed as a special case of our model with a consideration—for example, find[red] locates red things, fixed computational graph, consisting of a sequence of find while find[dog] locates dogs. Weights may be shared at modules followed by a describe module (see Section 4). both the type and instance level. Modules with no argu- Other policy- and algorithm-learning approaches with mod- ments implicitly take the image as input; higher-level mod- ular substructure include [16, 4]. [31] describe a procedure ules may also inspect the image. for learning to assemble programs from a collection of func- tional primitives whose behavior is fully specified. Our basic contribution is in both assembling this graph Find Image → Attention on the fly, and simultaneously in allowing the nodes to per- find[red] form heterogeneous computations, with ”messages” of dif- Convolution ferent kinds—raw image features, attentions, and classifica- tion predictions—passed from one module to the next. We are unaware of any previous work allowing such mixed col- A find module find[c] convolves every position in the input lections of modules to be trained jointly. image with a weight vector (distinct for each c) to produce 41

4.a heatmap or unnormalized attention. So, for example, the A measurement module measure[c] takes an attention alone output of the module find[dog] is a matrix whose entries and maps it to a distribution over labels. Because atten- should be large in regions of the image containing dogs, tions passed between modules are unnormalized, measure is and small everywhere else. suitable for evaluating the existence of a detected object, or counting sets of objects. Transform Attention → Attention 4.2. From strings to networks transform[above] Having built up an inventory of modules, we now need FC ReLU to assemble them into the layout specified by the question. ×2 The transformation from a natural language question to an instantiated neural network takes place in two steps. First The transform module transform[c] is implemented as a we map from natural language questions to layouts, which multilayer perceptron with rectified nonlinearities (ReLUs), specify both the set of modules used to answer a given ques- performing a fully-connected mapping from one attention tion, and the connections between them. Next we use these to another. Again, the weights for this mapping are dis- layouts are used to assemble the final prediction networks. tinct for each c. So transform[above] should take an atten- We use standard tools pre-trained on existing linguistic tion and shift the regions of greatest activation upward (as resources to obtain structured representations of questions. above), while transform[not] should move attention away Future work might focus on learning this prediction process from the active regions. For the experiments in this paper, jointly with the rest of the system. the first fully-connected (FC) layer produces a vector of size 32, and the second is the same size as the input. Parsing We begin by parsing each question with the Stan- ford Parser [19]. to obtain a universal dependency represen- Combine Attention × Attention → Attention tation [6]. Dependency parses express grammatical rela- combine[or] tions between parts of a sentence (e.g. between objects and Stack Conv. ReLU their attributes, or events and their participants), and pro- vide a lightweight abstraction away from the surface form of the sentence. The parser also performs basic lemmati- A combination module combine[c] merges two attentions zation, for example turning kites into kite and were into be. into a single attention. For example, combine[and] should This reduces sparsity of module instances. be active only in the regions that are active in both inputs, Next, we filter the set of dependencies to those connected while combine[or] should be active where the first input is the wh-word or copula in the question (the exact distance active and the second is inactive. It is implemented as a we traverse varies depending on the task, and how many is convolution followed by a nonlinearity. treated as a special case). This gives a simple symbolic form expressing (the primary) part of the sentence’s meaning.1 Describe Image × Attention → Label For example, what is standing in the field be- comes what(stand); what color is the truck becomes describe[color] color(truck), and is there a circle next to a square be- Attend FC red comes is(circle, next-to(square)). In the process we also strip away function words like determiners and modals, so what type of cakes were they? and what type of cake is A describe module describe[c] takes an attention and the it? both get converted to type(cake). The code for trans- input image and maps both to a distribution over labels. forming parse trees to structured queries is provided in the It first computes an average over image features weighted accompanying software package. by the attention, then passes this averaged feature vec- These representations bear a certain resemblance to tor through a single fully-connected layer. For example, pieces of a combinatory logic [23]: every leaf is implicitly describe[color] should return a representation of the col- a function taking the image as input, and the root represents ors in the region attended to. the final value of the computation. But our approach, while compositional and combinatorial, is crucially not logical: Measure Attention → Label 1 The Stanford parser achieves an F score of 87.2 for predicted attach- 1 measure[be] ments on the standard Penn Treebank benchmark [29]. While there is no gold-standard parsing data in the particular formal representation produced FC ReLU FC Softmax yes after our transformation is applied, the hand-inspection of parses described in Section 7 is broadly consistent with baseline parser accuracy. 42

5. yes find[tie] describe[color] yellow (a) NMN for answering the question What color is his (b) NMN for answering the question Is there a red shape above a circle? The tie? The find[tie] module first identifies the loca- two find modules locate the red shapes and circles, the transform[above] tion of the tie. The describe[color] module uses this shifts the attention above the circles, the combine module computes their heatmap to produce a weighted average of image fea- intersection, and the measure[is] module inspects the final attention and tures, which are finally used to predict an output label. determines that it is non-empty. Figure 2: Sample NMNs for question answering about natural images and shapes. For both examples, layouts, attentions, and answers are real predictions made by our model. the inferential computations operate on continuous repre- or even mix visual and non-visual specifications in their sentations produced by neural networks, becoming discrete queries: only in the prediction of the final answer. IS(cat) and date taken > 2014-11-5 Layout These symbolic representations already deter- Indeed, it is possible to construct this kind of “visual SQL” mine the structure of the predicted networks, but not the using precisely the approach described in this paper—once identities of the modules that compose them. This final as- our system is trained, the learned modules for attention, signment of modules is fully determined by the structure classification, etc. can be assembled by any kind of outside of the parse. All leaves become find modules, all inter- user, without relying on natural language specifically. nal nodes become transform or combine modules dependent on their arity, and root nodes become describe or measure 4.3. Answering natural language questions modules depending on the domain (see Section 6). So far our discussion has focused on the neural module Given the mapping from queries to network layouts de- net architecture, without reference to the remainder of Fig- scribed above, we have for each training example a net- ure 1. Our final model combines the output from the neu- work structure, an input image, and an output label. In ral module network with predictions from a simple LSTM many cases, these network structures are different, but question encoder. This is important for two reasons. First, have tied parameters. Networks which have the same because of the relatively aggressive simplification of the high-level structure but different instantiations of indi- question that takes place in the parser, grammatical cues that vidual modules (for example what color is the cat? / do not substantively change the semantics of the question— describe[color](find[cat]) and where is the truck? / but which might affect the answer—are discarded. For ex- describe[where](find[truck])) can be processed in the ample, what is flying? and what are flying? both get con- same batch, allowing efficient computation. verted to what(fly), but their answers might be kite and As noted above, parts of this conversion process are task- kites respectively given the same underlying image features. specific—we found that relatively simple expressions were The question encoder thus allows us to model underlying best for the natural image questions, while the synthetic syntactic regularities in the data. Second, it allows us to data (by design) required deeper structures. Some summary capture semantic regularities: with missing or low-quality statistics are provided in Table 1. image data, it is reasonable to guess that what color is the bear? is answered by brown, and unreasonable to guess Generalizations It is easy to imagine applications where green. The question encoder also allows us to model effects the input to the layout stage comes from something other of this kind. All experiments in this paper use a standard than a natural language parser. Users of an image database, single-layer LSTM with 1000 hidden units. for example, might write SQL-like queries directly in order To compute an answer, we pass the final hidden state to specify their requirements precisely, e.g. of the LSTM through a fully connected layer, add it ele- mentwise to the representation produced by the root mod- COUNT(AND(orange, cat)) == 3 ule of the NMN, apply a ReLU nonlinearity, and finally an- 43

6. types # instances # layouts max depth max size VQA find, combine, describe 877 51138 3 4 SHAPES find, transform, combine, measure 8 164 5 6 Table 1: Structure summary statistics for neural module networks used in this paper. “types” is the set of high-level module types available (e.g. find), “# instances” is the number of specific module instances (e.g. find[llama]), and “# layouts” is the number of distinct composed structures (e.g. describe[color](find[llama])). “Max depth” is the greatest depth across all layouts, while “max size” is the greatest number of modules—for example, the network in Figure 2b has depth 4 and size 5. (All numbers from training sets.) other fully connected layer and softmax to obtain a distribu- quite simple, for the most part requiring that only one or two tion over answers. In keeping with previous work, we have pieces of information be extracted from an image in order treated answer prediction as a pure classification problem: to answer it successfully, and with little evaluation of ro- the model selects from the set of answers observed during bustness in the presence of distractors (e.g. asking is there training (whether or not they contain multiple words), treat- a blue house in an image of a red house and a blue car). ing each answer as a distinct class. Thus no parameters are As one of the primary goals of this work is to learn shared between, e.g., left side and left in this final predic- models for deep semantic compositionality, we have cre- tion layer. The extension to a model in which multi-word ated SHAPES, a synthetic dataset that places such compo- answers are generated one word at a time by a recurrent de- sitional phenomena at the forefront. This dataset consists coder is straightforward, but we leave it for future work. of complex questions about simple arrangements of col- ored shapes (Figure 3). Questions contain between two and 5. Training neural module networks four attributes, object types, or relationships. The SHAPES Our training objective is simply to find module parame- dataset contains 244 unique questions, pairing each ques- ters maximizing the likelihood of the data. By design, the tion with 64 different images (for a total of 15616 unique last module in every network outputs a distribution over la- question/image pairs, with 14592 in the training set and bels, and so each assembled network also represents a prob- 1024 in the test set). To eliminate mode-guessing as a vi- ability distribution. able strategy, all questions have a yes-or-no answer, but Because of the dynamic network structures used to an- good performance requires that the system learn to recog- swer questions, some weights are updated much more fre- nize shapes and colors, and understand both spatial and log- quently than others. For this reason we found that learn- ical relations among sets of objects. ing algorithms with adaptive per-weight learning rates per- While success on this dataset is by no means a sufficient formed substantially better than simple gradient descent. condition for robust visual QA, we believe it is a necessary All the experiments described below use ADADELTA with one. In this respect it is similar in spirit to the bAbI [37] standard parameter settings [41]. dataset, and we hope that SHAPES will continue to be used It is important to emphasize that the labels we have as- in conjunction with natural image datasets. signed to distinguish instances of the same module type— To produce an initial set of image features, we pass the cat, color, etc.—are a notational convenience, and do not input image through the convolutional portion of a LeNet reflect any manual specification of the behavior of the corre- [22] which is jointly trained with the question-answering sponding modules. find[cat] is not fixed or even initialized part of the model. We compare our approach to a reim- as cat recognizer (rather than a couch recognizer or a dog plementation of the VIS+LSTM baseline similar to the one recognizer). Instead, it acquires this behavior as a byprod- described by [33], again swapping out the pre-trained image uct of the end-to-end training procedure. As can be seen embedding with a LeNet. in Figure 2, the image–answer pairs and parameter tying As can be seen in Table 2, our model achieves excellent together encourage each module to specialize in the appro- performance on this dataset, while the VIS+LSTM base- priate way. line fares little better than a majority guesser. Moreover, the color detectors and attention transformations behave as 6. Experiments: compositionality expected (Figure 2b), indicating that our joint training pro- We begin with a set of motivating experiments on syn- cedure correctly allocates responsibilities among modules. thetic data. Compositionality, and the corresponding abil- This confirms that our approach is able to model complex ity to answer questions with arbitrarily complex structure, compositional phenomena outside the capacity of previous is an essential part of the kind of deep image understand- approaches to visual question answering. ing visual QA datasets are intended to test. At the same We perform an additional experiment on a modified ver- time, questions in most existing natural image datasets are sion of the training set, which contains no size-6 questions 44

7. size 4 size 5 size 6 All test-dev test % of test set 31 56 13 Yes/No Number Other All All Majority 64.4 62.5 61.7 63.0 LSTM 78.7 36.6 28.1 49.8 – VIS+LSTM 71.9 62.5 61.7 65.3 VIS+LSTM [3] 2 78.9 35.2 36.4 53.7 54.1 NMN 89.7 92.4 85.2 90.6 ATT+LSTM 80.6 36.4 42.0 57.2 – NMN (train size ≤ 5) 97.7 91.1 89.7 90.8 NMN 70.7 36.8 39.2 54.8 – Table 2: Results on the SHAPES dataset. Here “size” is the number NMN+LSTM 81.2 35.2 43.3 58.0 – NMN+LSTM+FT 81.2 38.0 44.0 58.6 58.7 of modules needed to instantiate an appropriate NMN. Our model achieves high accuracy and outperforms a baseline from previ- ous work, especially on highly compositional questions. “NMN Table 3: Results on the VQA test server. LSTM is a question- (easy)” is a modified training set with no size-6 questions; these only baseline, VIS+LSTM is a previous baseline that combines results demonstrate that our model is able to generalize to ques- a question representation with a representation of the full image, tions more complicated than it has seen at training time. and ATT+LSTM is a model with the same attentional structure as our approach but no lexical information. NMN+LSTM is the full model shown in Figure 1, while NMN is an ablation exper- (i.e. questions whose corresponding NMN has 6 modules). iment with no whole-question LSTM. NMN+LSTM+FT is the Performance in this case is indistinguishable from the full same model, with image features fine-tuned on MSCOCO cap- tions. This model outperforms previous approaches, scoring par- training set; this demonstrates that our model is able to gen- ticularly well on questions not involving a binary decision. eralize to questions more complicated than those it has seen during training. Using linguistic information, the model ex- trapolates simple visual patterns to deeper structures. task. A breakdown of our questions by answer type reveals that our model performs especially well on questions an- swered by an object, attribute, or number. Investigation of 7. Experiments: natural images parser outputs also suggests that there is substantial room Next we consider the model’s ability to handle hard per- to improve the system using a better parser. A hand inspec- ceptual problems involving natural images. Here we eval- tion of the first 50 parses in the training set suggests that uate on the VQA dataset [3]. This is the largest resource most (80–90%) of questions asking for simple properties of of its kind, consisting of more than 200,000 images from objects are correctly analyzed, but more complicated ques- MSCOCO [25], each paired with three questions and ten tions are more prone to picking up irrelevant predicates. For answers per question generated by human annotators. We example are these people most likely experiencing a work train our model using the standard train/test split, training day? is parsed as be(people, likely), when the desired only with those answers marked as high confidence. The analysis is be(people, work). Parser errors of this kind visual input to the NMN is the conv5 layer of a 16-layer could be fixed with joint learning. VGGNet [35] after max-pooling, with features normalized Figure 3 is broadly suggestive of the kinds of predic- to have mean 0 and standard deviation 1. In addition to re- tion errors made by the system, including plausible seman- sults with the VGG pretained on ImageNet, we also report tic confusions (cardboard interpreted as leather, round win- results with the VGG fine-tuned (+FT) on MSCOCO for the dows interpreted as clocks), normal lexical variation (con- captioning task [7]. We find that performance is best on this tainer for cup), and use of answers that are a priori plausible task if the top-level module is always describe, even when but unrelated to the image (describing a horse as located in the question involves quantification. a pen rather than a barn). Results are shown in Table 3. We compare to a number of baselines, including a text-only baseline (LSTM), a pre- 8. Conclusions and future work vious baseline approach that predicts answers directly from In this paper, we have introduced neural module net- an encoding of the image and the question [3], and an at- works, which provide a general-purpose framework for tentional baseline (ATT+LSTM). This last baseline shares learning collections of neural modules which can be dy- the basic computational structure of our model without syn- namically assembled into arbitrary deep networks. We have tactic compositionality: it uses the same network layout for demonstrated that this approach achieves state-of-the-art every question (a find module followed by a describe mod- performance on existing datasets for visual question an- ule), with parameters tied across all problem instances. As can be seen in Table 1, the number of module types and in- 2 After the current work was accepted for publication, an improved stances is quite large. Rare words (occurring fewer than 10 version of this baseline was published, featuring a deeper sentence repre- times in the training data) are mapped to a single token or sentation and multiplicative interactions between the sentence and scene representations. This improved baseline gives an overall score of 57.8. module instance in the LSTM encoder and module network. We expect that many of these modifications could be applied to our own Our model outperforms all the listed baselines on this system to obtain similar gains. 45

8. how many different lights what is the color of the what color is the vase? is the bus full of passen- is there a red shape above in various different shapes horse? gers? a circle? and sizes? describe[count]( describe[color]( describe[color]( describe[is]( measure[is]( find[light]) find[horse]) find[vase]) combine[and]( combine[and]( find[bus], find[red], find[full]) transform[above]( find[circle]))) four (four) brown (brown) green (green) yes (yes) yes (yes) what is stuffed with where does the tabby cat what material are the is this a clock? is a red shape blue? toothbrushes wrapped in watch a horse eating hay? boxes made of? plastic? describe[what]( describe[where]( describe[material]( describe[is]( measure[is]( find[stuff]) find[watch]) find[box]) find[clock]) combine[and]( find[red], find[blue])) container (cup) pen (barn) leather (cardboard) yes (no) yes (no) Figure 3: Example output from our approach on different visual QA tasks. The top row shows correct answers, while the bottom row shows mistakes (the most common answer from human annotators is given in parentheses). swering, performing especially well on questions answered first steps toward joint learning of module behavior and a by an object or an attribute. Additionally, we have in- parser in a follow-up to this work [2]. troduced a new dataset of highly compositional questions The fact that our neural module networks can be about simple arrangements of shapes, and shown that our trained to produce predictable outputs—even when freely approach substantially outperforms previous work. composed—points toward a more general paradigm of So far we have maintained a strict separation between “programs” built from neural networks. In this paradigm, predicting network structures and learning network param- network designers (human or automated) have access to a eters. It is easy to imagine that these two problems might standard kit of neural parts from which to construct mod- be solved jointly, with uncertainty maintained over network els for performing complex reasoning tasks. While visual structures throughout training and decoding. This might be question answering provides a natural testbed for this ap- accomplished either with a monolithic network, by using proach, its usefulness is potentially much broader, extend- some higher-level mechanism to “attend” to relevant por- ing to queries about documents and structured knowledge tions of the computation, or else by integrating with exist- bases or more general function approximation and signal ing tools for learning semantic parsers [21]. We describe processing. 46

9.Acknowledgments [12] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich fea- ture hierarchies for accurate object detection and semantic The authors are grateful to Lisa Anne Hendricks, Eric segmentation. In Proceedings of the IEEE Conference on Tzeng, and Russell Stewart for useful conversations, and to Computer Vision and Pattern Recognition (CVPR), 2014. 2 Nvidia for a hardware grant. JA is supported by a National [13] S. Hochreiter and J. Schmidhuber. Long short-term memory. Science Foundation Graduate Research Fellowship. MR is Neural computation, 9(8):1735–1780, 1997. 3 supported by a fellowship within the FIT weltweit-Program [14] R. Hu, H. Xu, M. Rohrbach, J. Feng, K. Saenko, and T. Dar- of the German Academic Exchange Service (DAAD). TD rell. Natural language object retrieval. In Proceedings of the was supported in part by DARPA; AFRL; DoD MURI IEEE Conference on Computer Vision and Pattern Recogni- award N000141110688; NSF awards IIS-1212798, IIS- tion (CVPR), 2016. 3 1427425, and IIS-1536003, and the Berkeley Vision and [15] M. Iyyer, J. Boyd-Graber, L. Claudino, R. Socher, and Learning Center. H. Daum´e III. A neural network for factoid question an- swering over paragraphs. In Proceedings of the Confer- References ence on Empirical Methods in Natural Language Processing (EMNLP), 2014. 2 [1] J. Andreas and D. Klein. Grounding language with points [16] Ł. Kaiser and I. Sutskever. Neural gpus learn algorithms. and paths in continuous spaces. Proceedings of the Fifteenth arXiv preprint arXiv:1511.08228, 2015. 3 Conference on Computational Natural Language Learning [17] A. Karpathy and L. Fei-Fei. Deep visual-semantic align- (CoNLL), 2014. 3 ments for generating image descriptions. In Proceedings [2] J. Andreas, M. Rohrbach, T. Darrell, and D. Klein. Learning of the IEEE Conference on Computer Vision and Pattern to compose neural networks for question answering. In Pro- Recognition (CVPR), 2015. 3 ceedings of the Human Language Technology Conference of [18] A. Karpathy, A. Joulin, and L. Fei-Fei. Deep fragment em- the North American Chapter of the Association for Compu- beddings for bidirectional image sentence mapping. In Ad- tational Linguistics (NAACL-HLT), 2016. 8 vances in Neural Information Processing Systems (NIPS), [3] S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. L. 2014. 3 Zitnick, and D. Parikh. Vqa: Visual question answering. In [19] D. Klein and C. D. Manning. Accurate unlexicalized pars- Proceedings of the IEEE International Conference on Com- ing. In Proceedings of the Annual Meeting of the Association puter Vision (ICCV), 2015. 1, 2, 7 for Computational Linguistics (ACL), pages 423–430. Asso- [4] A. Braylan, M. Hollenbeck, E. Meyerson, and R. Miikku- ciation for Computational Linguistics, 2003. 4 lainen. Reuse of neural modules for general video game [20] C. Kong, D. Lin, M. Bansal, R. Urtasun, and S. Fidler. What playing. arXiv preprint arXiv:1512.01537, 2015. 3 are you talking about? text-to-image coreference. In Pro- [5] K. Cho, B. van Merri¨enboer, D. Bahdanau, and Y. Bengio. ceedings of the IEEE Conference on Computer Vision and On the properties of neural machine translation: Encoder- Pattern Recognition (CVPR), 2014. 3 decoder approaches. Workshop on Syntax, Semantics and [21] J. Krishnamurthy and T. Kollar. Jointly learning to parse Structure in Statistical Translation (SSST), 2014. 3 and perceive: connecting natural language to the physical [6] M.-C. De Marneffe and C. D. Manning. The Stanford typed world. Transactions of the Association for Computational dependencies representation. In Proceedings of the Interna- Linguistics (TACL), 2013. 1, 2, 3, 8 tional Conference on Computational Linguistics (COLING), pages 1–8. Association for Computational Linguistics, 2008. [22] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient- 4 based learning applied to document recognition. Proceed- [7] J. Donahue, L. A. Hendricks, S. Guadarrama, M. Rohrbach, ings of the IEEE, 86(11):2278–2324, 1998. 6 S. Venugopalan, K. Saenko, and T. Darrell. Long-term recur- [23] P. Liang, M. I. Jordan, and D. Klein. Learning dependency- rent convolutional networks for visual recognition and de- based compositional semantics. Computational Linguistics scription. In Proceedings of the IEEE Conference on Com- (CL), 39(2):389–446, 2013. 1, 3, 4 puter Vision and Pattern Recognition (CVPR), 2015. 7 [24] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ra- [8] J. L. Elman. Finding structure in time. Cognitive science, manan, P. Doll´ar, and C. L. Zitnick. Microsoft coco: Com- 14(2):179–211, 1990. 3 mon objects in context. In Proceedings of the European Con- [9] N. FitzGerald, Y. Artzi, and L. Zettlemoyer. Learning dis- ference on Computer Vision (ECCV), 2014. 2 tributions over logical forms for referring expression gener- [25] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ra- ation. In Proceedings of the Conference on Empirical Meth- manan, P. Doll´ar, and C. L. Zitnick. Microsoft coco: Com- ods in Natural Language Processing (EMNLP), 2013. 2 mon objects in context. In Proceedings of the European Con- [10] H. Gao, J. Mao, J. Zhou, Z. Huang, L. Wang, and W. Xu. ference on Computer Vision (ECCV), 2014. 7 Are you talking to a machine? dataset and methods for mul- [26] L. Ma and Z. L. andiyyer Hang Li. Learning to answer tilingual image question answering. In Advances in Neural questions from image using convolutional neural network. Information Processing Systems (NIPS), 2015. 1, 3 In Proceedings of the Conference on Artificial Intelligence [11] D. Geman, S. Geman, N. Hallonquist, and L. Younes. Visual (AAAI), 2016. 1, 3 turing test for computer vision systems. Proceedings of the [27] M. Malinowski and M. Fritz. A multi-world approach to National Academy of Sciences, 2015. 2 question answering about real-world scenes based on uncer- 47

10. tain input. In Advances in Neural Information Processing Systems (NIPS), 2014. 1, 2 [28] M. Malinowski, M. Rohrbach, and M. Fritz. Ask your neu- rons: A neural-based approach to answering questions about images. In Proceedings of the IEEE International Confer- ence on Computer Vision (ICCV), 2015. 1, 2 [29] M. P. Marcus, M. A. Marcinkiewicz, and B. Santorini. Build- ing a large annotated corpus of english: The penn treebank. Computational linguistics, 19(2):313–330, 1993. 4 [30] C. Matuszek, N. Fitzgerald, L. Zettlemoyer, L. Bo, and D. Fox. A joint model of language and perception for grounded attribute learning. In Proceedings of the Interna- tional Conference on Machine Learning (ICML), 2012. 3 [31] A. Neelakantan, Q. V. Le, and I. Sutskever. Neural program- mer: Inducing latent programs with gradient descent. arXiv preprint arXiv:1511.04834, 2015. 3 [32] B. Plummer, L. Wang, C. Cervantes, J. Caicedo, J. Hock- enmaier, and S. Lazebnik. Flickr30k entities: Collect- ing region-to-phrase correspondences for richer image-to- sentence models. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2015. 3 [33] M. Ren, R. Kiros, and R. Zemel. Exploring models and data for image question answering. In Advances in Neural Infor- mation Processing Systems (NIPS), 2015. 1, 3, 6 [34] A. Rohrbach, M. Rohrbach, R. Hu, T. Darrell, and B. Schiele. Grounding of textual phrases in images by re- construction. arXiv preprint arXiv:1511.03745, 2015. 2 [35] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In Proceedings of the International Conference on Learning Representations (ICLR), 2015. 7 [36] R. Socher, J. Bauer, C. D. Manning, and A. Y. Ng. Parsing with compositional vector grammars. In Proceedings of the Annual Meeting of the Association for Computational Lin- guistics (ACL), 2013. 3 [37] J. Weston, A. Bordes, S. Chopra, and T. Mikolov. Towards ai-complete question answering: a set of prerequisite toy tasks. arXiv preprint arXiv:1502.05698, 2015. 6 [38] J. Weston, S. Chopra, and A. Bordes. Memory networks. In Proceedings of the International Conference on Learning Representations (ICLR), 2015. 3 [39] K. Xu, J. Ba, R. Kiros, A. Courville, R. Salakhutdinov, R. Zemel, and Y. Bengio. Show, attend and tell: Neural im- age caption generation with visual attention. Proceedings of the International Conference on Machine Learning (ICML), 2015. 3 [40] L. Yu, E. Park, A. C. Berg, and T. L. Berg. Visual madlibs: Fill in the blank image generation and question answering. In Proceedings of the IEEE International Conference on Com- puter Vision (ICCV), 2015. 1, 2 [41] M. D. Zeiler. ADADELTA: An adaptive learning rate method. arXiv preprint arXiv:1212.5701, 2012. 6 48