Rich feature hierarchies for accurate object detection and semantic segmentation

Object detection performance, as measured on the canonical PASCAL VOC dataset, has plateaued in the last few years. The best-performing methods are complex ensemble systems that typically combine multiple low-level image features with high-level context. In this paper, we propose a simple and scalable detection algorithm that improves mean average precision (mAP) by more than 30% relative to the previous best result on VOC 2012—achieving a mAP of 53.3%. Our approach combines two key insights: (1) one can apply high-capacity convolutional neural networks (CNNs) to bottom-up region proposals in order tolocalize and segment objects and (2) when labeled training data is scarce, supervised pre-training for an auxiliary task,followed by domain-specific fine-tuning, yields a significant performance boost. Since we combine region proposals with CNNs, we call our method R-CNN: Regions with CNN features. We also compare R-CNN to OverFeat, a recently proposed sliding-window detector based on a similar CNN architecture. We find that R-CNN outperforms OverFeat by a large margin on the 200-class ILSVRC2013 detection dataset. Source code for the complete system is available at˜rbg/rcnn

1. Rich feature hierarchies for accurate object detection and semantic segmentation Tech report (v5) Ross Girshick Jeff Donahue Trevor Darrell Jitendra Malik UC Berkeley {rbg,jdonahue,trevor,malik} arXiv:1311.2524v5 [cs.CV] 22 Oct 2014 Abstract R-CNN: Regions with CNN features warped region aeroplane? no. .. Object detection performance, as measured on the . person? yes. canonical PASCAL VOC dataset, has plateaued in the last .. CNN . few years. The best-performing methods are complex en- tvmonitor? no. semble systems that typically combine multiple low-level 1. Input 2. Extract region 3. Compute 4. Classify image features with high-level context. In this paper, we image proposals (~2k) CNN features regions propose a simple and scalable detection algorithm that im- Figure 1: Object detection system overview. Our system (1) proves mean average precision (mAP) by more than 30% takes an input image, (2) extracts around 2000 bottom-up region relative to the previous best result on VOC 2012—achieving proposals, (3) computes features for each proposal using a large a mAP of 53.3%. Our approach combines two key insights: convolutional neural network (CNN), and then (4) classifies each (1) one can apply high-capacity convolutional neural net- region using class-specific linear SVMs. R-CNN achieves a mean works (CNNs) to bottom-up region proposals in order to average precision (mAP) of 53.7% on PASCAL VOC 2010. For comparison, [39] reports 35.1% mAP using the same region pro- localize and segment objects and (2) when labeled training posals, but with a spatial pyramid and bag-of-visual-words ap- data is scarce, supervised pre-training for an auxiliary task, proach. The popular deformable part models perform at 33.4%. followed by domain-specific fine-tuning, yields a significant On the 200-class ILSVRC2013 detection dataset, R-CNN’s performance boost. Since we combine region proposals mAP is 31.4%, a large improvement over OverFeat [34], which with CNNs, we call our method R-CNN: Regions with CNN had the previous best result at 24.3%. features. We also compare R-CNN to OverFeat, a recently proposed sliding-window detector based on a similar CNN archical, multi-stage processes for computing features that architecture. We find that R-CNN outperforms OverFeat are even more informative for visual recognition. by a large margin on the 200-class ILSVRC2013 detection Fukushima’s “neocognitron” [19], a biologically- dataset. Source code for the complete system is available at inspired hierarchical and shift-invariant model for pattern˜rbg/rcnn. recognition, was an early attempt at just such a process. The neocognitron, however, lacked a supervised training 1. Introduction algorithm. Building on Rumelhart et al. [33], LeCun et al. [26] showed that stochastic gradient descent via back- Features matter. The last decade of progress on various propagation was effective for training convolutional neural visual recognition tasks has been based considerably on the networks (CNNs), a class of models that extend the neocog- use of SIFT [29] and HOG [7]. But if we look at perfor- nitron. mance on the canonical visual recognition task, PASCAL CNNs saw heavy use in the 1990s (e.g., [27]), but then VOC object detection [15], it is generally acknowledged fell out of fashion with the rise of support vector machines. that progress has been slow during 2010-2012, with small In 2012, Krizhevsky et al. [25] rekindled interest in CNNs gains obtained by building ensemble systems and employ- by showing substantially higher image classification accu- ing minor variants of successful methods. racy on the ImageNet Large Scale Visual Recognition Chal- SIFT and HOG are blockwise orientation histograms, lenge (ILSVRC) [9, 10]. Their success resulted from train- a representation we could associate roughly with complex ing a large CNN on 1.2 million labeled images, together cells in V1, the first cortical area in the primate visual path- with a few twists on LeCun’s CNN (e.g., max(x, 0) rectify- way. But we also know that recognition occurs several ing non-linearities and “dropout” regularization). stages downstream, which suggests that there might be hier- The significance of the ImageNet result was vigorously 1

2.debated during the ILSVRC 2012 workshop. The central is scarce and the amount currently available is insufficient issue can be distilled to the following: To what extent do for training a large CNN. The conventional solution to this the CNN classification results on ImageNet generalize to problem is to use unsupervised pre-training, followed by su- object detection results on the PASCAL VOC Challenge? pervised fine-tuning (e.g., [35]). The second principle con- We answer this question by bridging the gap between tribution of this paper is to show that supervised pre-training image classification and object detection. This paper is the on a large auxiliary dataset (ILSVRC), followed by domain- first to show that a CNN can lead to dramatically higher ob- specific fine-tuning on a small dataset (PASCAL), is an ject detection performance on PASCAL VOC as compared effective paradigm for learning high-capacity CNNs when to systems based on simpler HOG-like features. To achieve data is scarce. In our experiments, fine-tuning for detection this result, we focused on two problems: localizing objects improves mAP performance by 8 percentage points. After with a deep network and training a high-capacity model fine-tuning, our system achieves a mAP of 54% on VOC with only a small quantity of annotated detection data. 2010 compared to 33% for the highly-tuned, HOG-based Unlike image classification, detection requires localiz- deformable part model (DPM) [17, 20]. We also point read- ing (likely many) objects within an image. One approach ers to contemporaneous work by Donahue et al. [12], who frames localization as a regression problem. However, work show that Krizhevsky’s CNN can be used (without fine- from Szegedy et al. [38], concurrent with our own, indi- tuning) as a blackbox feature extractor, yielding excellent cates that this strategy may not fare well in practice (they performance on several recognition tasks including scene report a mAP of 30.5% on VOC 2007 compared to the classification, fine-grained sub-categorization, and domain 58.5% achieved by our method). An alternative is to build a adaptation. sliding-window detector. CNNs have been used in this way Our system is also quite efficient. The only class-specific for at least two decades, typically on constrained object cat- computations are a reasonably small matrix-vector product egories, such as faces [32, 40] and pedestrians [35]. In order and greedy non-maximum suppression. This computational to maintain high spatial resolution, these CNNs typically property follows from features that are shared across all cat- only have two convolutional and pooling layers. We also egories and that are also two orders of magnitude lower- considered adopting a sliding-window approach. However, dimensional than previously used region features (cf. [39]). units high up in our network, which has five convolutional Understanding the failure modes of our approach is also layers, have very large receptive fields (195 × 195 pixels) critical for improving it, and so we report results from the and strides (32×32 pixels) in the input image, which makes detection analysis tool of Hoiem et al. [23]. As an im- precise localization within the sliding-window paradigm an mediate consequence of this analysis, we demonstrate that open technical challenge. a simple bounding-box regression method significantly re- Instead, we solve the CNN localization problem by oper- duces mislocalizations, which are the dominant error mode. ating within the “recognition using regions” paradigm [21], Before developing technical details, we note that because which has been successful for both object detection [39] and R-CNN operates on regions it is natural to extend it to the semantic segmentation [5]. At test time, our method gener- task of semantic segmentation. With minor modifications, ates around 2000 category-independent region proposals for we also achieve competitive results on the PASCAL VOC the input image, extracts a fixed-length feature vector from segmentation task, with an average segmentation accuracy each proposal using a CNN, and then classifies each region of 47.9% on the VOC 2011 test set. with category-specific linear SVMs. We use a simple tech- nique (affine image warping) to compute a fixed-size CNN 2. Object detection with R-CNN input from each region proposal, regardless of the region’s Our object detection system consists of three modules. shape. Figure 1 presents an overview of our method and The first generates category-independent region proposals. highlights some of our results. Since our system combines These proposals define the set of candidate detections avail- region proposals with CNNs, we dub the method R-CNN: able to our detector. The second module is a large convo- Regions with CNN features. lutional neural network that extracts a fixed-length feature In this updated version of this paper, we provide a head- vector from each region. The third module is a set of class- to-head comparison of R-CNN and the recently proposed specific linear SVMs. In this section, we present our design OverFeat [34] detection system by running R-CNN on the decisions for each module, describe their test-time usage, 200-class ILSVRC2013 detection dataset. OverFeat uses a detail how their parameters are learned, and show detection sliding-window CNN for detection and until now was the results on PASCAL VOC 2010-12 and on ILSVRC2013. best performing method on ILSVRC2013 detection. We show that R-CNN significantly outperforms OverFeat, with 2.1. Module design a mAP of 31.4% versus 24.3%. Region proposals. A variety of recent papers offer meth- A second challenge faced in detection is that labeled data ods for generating category-independent region proposals. 2

3. are low-dimensional when compared to other common ap- proaches, such as spatial pyramids with bag-of-visual-word encodings. The features used in the UVA detection system [39], for example, are two orders of magnitude larger than aeroplane bicycle bird car ours (360k vs. 4k-dimensional). Figure 2: Warped training samples from VOC 2007 train. The result of such sharing is that the time spent com- puting region proposals and features (13s/image on a GPU Examples include: objectness [1], selective search [39], or 53s/image on a CPU) is amortized over all classes. The category-independent object proposals [14], constrained only class-specific computations are dot products between parametric min-cuts (CPMC) [5], multi-scale combinatorial features and SVM weights and non-maximum suppression. grouping [3], and Cires¸an et al. [6], who detect mitotic cells In practice, all dot products for an image are batched into by applying a CNN to regularly-spaced square crops, which a single matrix-matrix product. The feature matrix is typi- are a special case of region proposals. While R-CNN is ag- cally 2000 × 4096 and the SVM weight matrix is 4096 × N , nostic to the particular region proposal method, we use se- where N is the number of classes. lective search to enable a controlled comparison with prior This analysis shows that R-CNN can scale to thousands detection work (e.g., [39, 41]). of object classes without resorting to approximate tech- niques, such as hashing. Even if there were 100k classes, Feature extraction. We extract a 4096-dimensional fea- the resulting matrix multiplication takes only 10 seconds on ture vector from each region proposal using the Caffe [24] a modern multi-core CPU. This efficiency is not merely the implementation of the CNN described by Krizhevsky et result of using region proposals and shared features. The al. [25]. Features are computed by forward propagating UVA system, due to its high-dimensional features, would a mean-subtracted 227 × 227 RGB image through five con- be two orders of magnitude slower while requiring 134GB volutional layers and two fully connected layers. We refer of memory just to store 100k linear predictors, compared to readers to [24, 25] for more network architecture details. just 1.5GB for our lower-dimensional features. In order to compute features for a region proposal, we It is also interesting to contrast R-CNN with the recent must first convert the image data in that region into a form work from Dean et al. on scalable detection using DPMs that is compatible with the CNN (its architecture requires and hashing [8]. They report a mAP of around 16% on VOC inputs of a fixed 227 × 227 pixel size). Of the many possi- 2007 at a run-time of 5 minutes per image when introducing ble transformations of our arbitrary-shaped regions, we opt 10k distractor classes. With our approach, 10k detectors can for the simplest. Regardless of the size or aspect ratio of the run in about a minute on a CPU, and because no approxi- candidate region, we warp all pixels in a tight bounding box mations are made mAP would remain at 59% (Section 3.2). around it to the required size. Prior to warping, we dilate the tight bounding box so that at the warped size there are ex- 2.3. Training actly p pixels of warped image context around the original Supervised pre-training. We discriminatively pre-trained box (we use p = 16). Figure 2 shows a random sampling the CNN on a large auxiliary dataset (ILSVRC2012 clas- of warped training regions. Alternatives to warping are dis- sification) using image-level annotations only (bounding- cussed in Appendix A. box labels are not available for this data). Pre-training 2.2. Test-time detection was performed using the open source Caffe CNN library [24]. In brief, our CNN nearly matches the performance At test time, we run selective search on the test image of Krizhevsky et al. [25], obtaining a top-1 error rate 2.2 to extract around 2000 region proposals (we use selective percentage points higher on the ILSVRC2012 classification search’s “fast mode” in all experiments). We warp each validation set. This discrepancy is due to simplifications in proposal and forward propagate it through the CNN in or- the training process. der to compute features. Then, for each class, we score each extracted feature vector using the SVM trained for that Domain-specific fine-tuning. To adapt our CNN to the class. Given all scored regions in an image, we apply a new task (detection) and the new domain (warped proposal greedy non-maximum suppression (for each class indepen- windows), we continue stochastic gradient descent (SGD) dently) that rejects a region if it has an intersection-over- training of the CNN parameters using only warped region union (IoU) overlap with a higher scoring selected region proposals. Aside from replacing the CNN’s ImageNet- larger than a learned threshold. specific 1000-way classification layer with a randomly ini- tialized (N + 1)-way classification layer (where N is the Run-time analysis. Two properties make detection effi- number of object classes, plus 1 for background), the CNN cient. First, all CNN parameters are shared across all cate- architecture is unchanged. For VOC, N = 20 and for gories. Second, the feature vectors computed by the CNN ILSVRC2013, N = 200. We treat all region proposals with 3

4.≥ 0.5 IoU overlap with a ground-truth box as positives for densely sampled SIFT, Extended OpponentSIFT, and RGB- that box’s class and the rest as negatives. We start SGD at SIFT descriptors, each vector quantized with 4000-word a learning rate of 0.001 (1/10th of the initial pre-training codebooks. Classification is performed with a histogram rate), which allows fine-tuning to make progress while not intersection kernel SVM. Compared to their multi-feature, clobbering the initialization. In each SGD iteration, we uni- non-linear kernel SVM approach, we achieve a large im- formly sample 32 positive windows (over all classes) and provement in mAP, from 35.1% to 53.7% mAP, while also 96 background windows to construct a mini-batch of size being much faster (Section 2.2). Our method achieves sim- 128. We bias the sampling towards positive windows be- ilar performance (53.3% mAP) on VOC 2011/12 test. cause they are extremely rare compared to background. 2.5. Results on ILSVRC2013 detection Object category classifiers. Consider training a binary classifier to detect cars. It’s clear that an image region We ran R-CNN on the 200-class ILSVRC2013 detection tightly enclosing a car should be a positive example. Simi- dataset using the same system hyperparameters that we used larly, it’s clear that a background region, which has nothing for PASCAL VOC. We followed the same protocol of sub- to do with cars, should be a negative example. Less clear mitting test results to the ILSVRC2013 evaluation server is how to label a region that partially overlaps a car. We re- only twice, once with and once without bounding-box re- solve this issue with an IoU overlap threshold, below which gression. regions are defined as negatives. The overlap threshold, 0.3, Figure 3 compares R-CNN to the entries in the ILSVRC was selected by a grid search over {0, 0.1, . . . , 0.5} on a 2013 competition and to the post-competition OverFeat re- validation set. We found that selecting this threshold care- sult [34]. R-CNN achieves a mAP of 31.4%, which is sig- fully is important. Setting it to 0.5, as in [39], decreased nificantly ahead of the second-best result of 24.3% from mAP by 5 points. Similarly, setting it to 0 decreased mAP OverFeat. To give a sense of the AP distribution over by 4 points. Positive examples are defined simply to be the classes, box plots are also presented and a table of per- ground-truth bounding boxes for each class. class APs follows at the end of the paper in Table 8. Most Once features are extracted and training labels are ap- of the competing submissions (OverFeat, NEC-MU, UvA- plied, we optimize one linear SVM per class. Since the Euvision, Toronto A, and UIUC-IFP) used convolutional training data is too large to fit in memory, we adopt the neural networks, indicating that there is significant nuance standard hard negative mining method [17, 37]. Hard neg- in how CNNs can be applied to object detection, leading to ative mining converges quickly and in practice mAP stops greatly varying outcomes. increasing after only a single pass over all images. In Section 4, we give an overview of the ILSVRC2013 In Appendix B we discuss why the positive and negative detection dataset and provide details about choices that we examples are defined differently in fine-tuning versus SVM made when running R-CNN on it. training. We also discuss the trade-offs involved in training detection SVMs rather than simply using the outputs from the final softmax layer of the fine-tuned CNN. 3. Visualization, ablation, and modes of error 2.4. Results on PASCAL VOC 2010-12 3.1. Visualizing learned features Following the PASCAL VOC best practices [15], we First-layer filters can be visualized directly and are easy validated all design decisions and hyperparameters on the to understand [25]. They capture oriented edges and oppo- VOC 2007 dataset (Section 3.2). For final results on the nent colors. Understanding the subsequent layers is more VOC 2010-12 datasets, we fine-tuned the CNN on VOC challenging. Zeiler and Fergus present a visually attrac- 2012 train and optimized our detection SVMs on VOC 2012 tive deconvolutional approach in [42]. We propose a simple trainval. We submitted test results to the evaluation server (and complementary) non-parametric method that directly only once for each of the two major algorithm variants (with shows what the network learned. and without bounding-box regression). The idea is to single out a particular unit (feature) in the Table 1 shows complete results on VOC 2010. We com- network and use it as if it were an object detector in its own pare our method against four strong baselines, including right. That is, we compute the unit’s activations on a large SegDPM [18], which combines DPM detectors with the set of held-out region proposals (about 10 million), sort the output of a semantic segmentation system [4] and uses ad- proposals from highest to lowest activation, perform non- ditional inter-detector context and image-classifier rescor- maximum suppression, and then display the top-scoring re- ing. The most germane comparison is to the UVA system gions. Our method lets the selected unit “speak for itself” from Uijlings et al. [39], since our systems use the same re- by showing exactly which inputs it fires on. We avoid aver- gion proposal algorithm. To classify regions, their method aging in order to see different visual modes and gain insight builds a four-level spatial pyramid and populates it with into the invariances computed by the unit. 4

5.VOC 2010 test aero bike bird boat bottle bus car cat chair cow table dog horse mbike person plant sheep sofa train tv mAP DPM v5 [20]† 49.2 53.8 13.1 15.3 35.5 53.4 49.7 27.0 17.2 28.8 14.7 17.8 46.4 51.2 47.7 10.8 34.2 20.7 43.8 38.3 33.4 UVA [39] 56.2 42.4 15.3 12.6 21.8 49.3 36.8 46.1 12.9 32.1 30.0 36.5 43.5 52.9 32.9 15.3 41.1 31.8 47.0 44.8 35.1 Regionlets [41] 65.0 48.9 25.9 24.6 24.5 56.1 54.5 51.2 17.0 28.9 30.2 35.8 40.2 55.7 43.5 14.3 43.9 32.6 54.0 45.9 39.7 SegDPM [18]† 61.4 53.4 25.6 25.2 35.5 51.7 50.6 50.8 19.3 33.8 26.8 40.4 48.3 54.4 47.1 14.8 38.7 35.0 52.8 43.1 40.4 R-CNN 67.1 64.1 46.7 32.0 30.5 56.4 57.2 65.9 27.0 47.3 40.9 66.6 57.8 65.9 53.6 26.7 56.5 38.1 52.8 50.2 50.2 R-CNN BB 71.8 65.8 53.0 36.8 35.9 59.7 60.0 69.9 27.9 50.6 41.4 70.0 62.0 69.0 58.1 29.5 59.4 39.3 61.2 52.4 53.7 Table 1: Detection average precision (%) on VOC 2010 test. R-CNN is most directly comparable to UVA and Regionlets since all methods use selective search region proposals. Bounding-box regression (BB) is described in Section C. At publication time, SegDPM was the top-performer on the PASCAL VOC leaderboard. † DPM and SegDPM use context rescoring not used by the other methods. ILSVRC2013 detection test set mAP ILSVRC2013 detection test set class AP box plots 100 *R−CNN BB 31.4% average precision (AP) in % 90 *OverFeat (2) 24.3% 80 UvA−Euvision 22.6% 70 60 *NEC−MU 20.9% 50 *OverFeat (1) 19.4% 40 Toronto A 11.5% 30 SYSU_Vision 10.5% 20 10 GPU_UCLA 9.8% 0 UvA−Euvision SYSU_Vision *OverFeat (1) Delta 6.1% *R−CNN BB GPU_UCLA *NEC−MU UIUC−IFP competition result Toronto A UIUC−IFP 1.0% post competition result Delta 0 20 40 60 80 100 mean average precision (mAP) in % Figure 3: (Left) Mean average precision on the ILSVRC2013 detection test set. Methods preceeded by * use outside training data (images and labels from the ILSVRC classification dataset in all cases). (Right) Box plots for the 200 average precision values per method. A box plot for the post-competition OverFeat result is not shown because per-class APs are not yet available (per-class APs for R-CNN are in Table 8 and also included in the tech report source uploaded to; see R-CNN-ILSVRC2013-APs.txt). The red line marks the median AP, the box bottom and top are the 25th and 75th percentiles. The whiskers extend to the min and max AP of each method. Each AP is plotted as a green dot over the whiskers (best viewed digitally with zoom). 1.0 1.0 0.9 0.9 0.9 0.9 0.9 0.9 0.9 0.9 0.9 0.9 0.9 0.9 0.9 0.9 1.0 0.9 0.9 0.8 0.8 0.8 0.7 0.7 0.7 0.7 0.7 0.7 0.7 0.7 0.6 0.6 1.0 0.8 0.7 0.7 0.7 0.7 0.7 0.7 0.7 0.7 0.7 0.7 0.7 0.7 0.6 0.6 1.0 0.9 0.8 0.8 0.8 0.7 0.7 0.7 0.7 0.7 0.7 0.7 0.7 0.7 0.7 0.7 1.0 1.0 0.9 0.9 0.9 0.8 0.8 0.8 0.8 0.8 0.8 0.8 0.8 0.8 0.8 0.8 1.0 0.9 0.8 0.8 0.8 0.7 0.7 0.7 0.7 0.7 0.7 0.7 0.7 0.7 0.7 0.7 Figure 4: Top regions for six pool5 units. Receptive fields and activation values are drawn in white. Some units are aligned to concepts, such as people (row 1) or text (4). Other units capture texture and material properties, such as dot arrays (2) and specular reflections (6). 5

6.VOC 2007 test aero bike bird boat bottle bus car cat chair cow table dog horse mbike person plant sheep sofa train tv mAP R-CNN pool5 51.8 60.2 36.4 27.8 23.2 52.8 60.6 49.2 18.3 47.8 44.3 40.8 56.6 58.7 42.4 23.4 46.1 36.7 51.3 55.7 44.2 R-CNN fc6 59.3 61.8 43.1 34.0 25.1 53.1 60.6 52.8 21.7 47.8 42.7 47.8 52.5 58.5 44.6 25.6 48.3 34.0 53.1 58.0 46.2 R-CNN fc7 57.6 57.9 38.5 31.8 23.7 51.2 58.9 51.4 20.0 50.5 40.9 46.0 51.6 55.9 43.3 23.3 48.1 35.3 51.0 57.4 44.7 R-CNN FT pool5 58.2 63.3 37.9 27.6 26.1 54.1 66.9 51.4 26.7 55.5 43.4 43.1 57.7 59.0 45.8 28.1 50.8 40.6 53.1 56.4 47.3 R-CNN FT fc6 63.5 66.0 47.9 37.7 29.9 62.5 70.2 60.2 32.0 57.9 47.0 53.5 60.1 64.2 52.2 31.3 55.0 50.0 57.7 63.0 53.1 R-CNN FT fc7 64.2 69.7 50.0 41.9 32.0 62.6 71.0 60.7 32.7 58.5 46.5 56.1 60.6 66.8 54.2 31.5 52.8 48.9 57.9 64.7 54.2 R-CNN FT fc7 BB 68.1 72.8 56.8 43.0 36.8 66.3 74.2 67.6 34.4 63.5 54.5 61.2 69.1 68.6 58.7 33.4 62.9 51.1 62.5 64.8 58.5 DPM v5 [20] 33.2 60.3 10.2 16.1 27.3 54.3 58.2 23.0 20.0 24.1 26.7 12.7 58.1 48.2 43.2 12.0 21.1 36.1 46.0 43.5 33.7 DPM ST [28] 23.8 58.2 10.5 8.5 27.1 50.4 52.0 7.3 19.2 22.8 18.1 8.0 55.9 44.8 32.4 13.3 15.9 22.8 46.2 44.9 29.1 DPM HSC [31] 32.2 58.3 11.5 16.3 30.6 49.9 54.8 23.5 21.5 27.7 34.0 13.7 58.1 51.6 39.9 12.4 23.5 34.4 47.4 45.2 34.3 Table 2: Detection average precision (%) on VOC 2007 test. Rows 1-3 show R-CNN performance without fine-tuning. Rows 4-6 show results for the CNN pre-trained on ILSVRC 2012 and then fine-tuned (FT) on VOC 2007 trainval. Row 7 includes a simple bounding-box regression (BB) stage that reduces localization errors (Section C). Rows 8-10 present DPM methods as a strong baseline. The first uses only HOG, while the next two use different feature learning approaches to augment or replace HOG. VOC 2007 test aero bike bird boat bottle bus car cat chair cow table dog horse mbike person plant sheep sofa train tv mAP R-CNN T-Net 64.2 69.7 50.0 41.9 32.0 62.6 71.0 60.7 32.7 58.5 46.5 56.1 60.6 66.8 54.2 31.5 52.8 48.9 57.9 64.7 54.2 R-CNN T-Net BB 68.1 72.8 56.8 43.0 36.8 66.3 74.2 67.6 34.4 63.5 54.5 61.2 69.1 68.6 58.7 33.4 62.9 51.1 62.5 64.8 58.5 R-CNN O-Net 71.6 73.5 58.1 42.2 39.4 70.7 76.0 74.5 38.7 71.0 56.9 74.5 67.9 69.6 59.3 35.7 62.1 64.0 66.5 71.2 62.2 R-CNN O-Net BB 73.4 77.0 63.4 45.4 44.6 75.1 78.1 79.8 40.5 73.7 62.2 79.4 78.1 73.1 64.2 35.6 66.8 67.2 70.4 71.1 66.0 Table 3: Detection average precision (%) on VOC 2007 test for two different CNN architectures. The first two rows are results from Table 2 using Krizhevsky et al.’s architecture (T-Net). Rows three and four use the recently proposed 16-layer architecture from Simonyan and Zisserman (O-Net) [43]. We visualize units from layer pool5 , which is the max- Layer fc6 is fully connected to pool5 . To compute fea- pooled output of the network’s fifth and final convolutional tures, it multiplies a 4096×9216 weight matrix by the pool5 layer. The pool5 feature map is 6 × 6 × 256 = 9216- feature map (reshaped as a 9216-dimensional vector) and dimensional. Ignoring boundary effects, each pool5 unit has then adds a vector of biases. This intermediate vector is a receptive field of 195×195 pixels in the original 227×227 component-wise half-wave rectified (x ← max(0, x)). pixel input. A central pool5 unit has a nearly global view, Layer fc7 is the final layer of the network. It is imple- while one near the edge has a smaller, clipped support. mented by multiplying the features computed by fc6 by a Each row in Figure 4 displays the top 16 activations for 4096 × 4096 weight matrix, and similarly adding a vector a pool5 unit from a CNN that we fine-tuned on VOC 2007 of biases and applying half-wave rectification. trainval. Six of the 256 functionally unique units are visu- We start by looking at results from the CNN without alized (Appendix D includes more). These units were se- fine-tuning on PASCAL, i.e. all CNN parameters were lected to show a representative sample of what the network pre-trained on ILSVRC 2012 only. Analyzing performance learns. In the second row, we see a unit that fires on dog layer-by-layer (Table 2 rows 1-3) reveals that features from faces and dot arrays. The unit corresponding to the third row fc7 generalize worse than features from fc6 . This means is a red blob detector. There are also detectors for human that 29%, or about 16.8 million, of the CNN’s parameters faces and more abstract patterns such as text and triangular can be removed without degrading mAP. More surprising is structures with windows. The network appears to learn a that removing both fc7 and fc6 produces quite good results representation that combines a small number of class-tuned even though pool5 features are computed using only 6% of features together with a distributed representation of shape, the CNN’s parameters. Much of the CNN’s representational texture, color, and material properties. The subsequent fully power comes from its convolutional layers, rather than from connected layer fc6 has the ability to model a large set of the much larger densely connected layers. This finding sug- compositions of these rich features. gests potential utility in computing a dense feature map, in the sense of HOG, of an arbitrary-sized image by using only 3.2. Ablation studies the convolutional layers of the CNN. This representation Performance layer-by-layer, without fine-tuning. To un- would enable experimentation with sliding-window detec- derstand which layers are critical for detection performance, tors, including DPM, on top of pool5 features. we analyzed results on the VOC 2007 dataset for each of the CNN’s last three layers. Layer pool5 was briefly described Performance layer-by-layer, with fine-tuning. We now in Section 3.1. The final two layers are summarized below. look at results from our CNN after having fine-tuned its pa- 6

7.rameters on VOC 2007 trainval. The improvement is strik- To use O-Net in R-CNN, we downloaded the pub- ing (Table 2 rows 4-6): fine-tuning increases mAP by 8.0 licly available pre-trained network weights for the percentage points to 54.2%. The boost from fine-tuning is VGG ILSVRC 16 layers model from the Caffe Model much larger for fc6 and fc7 than for pool5 , which suggests Zoo.1 We then fine-tuned the network using the same pro- that the pool5 features learned from ImageNet are general tocol as we used for T-Net. The only difference was to use and that most of the improvement is gained from learning smaller minibatches (24 examples) as required in order to domain-specific non-linear classifiers on top of them. fit within GPU memory. The results in Table 3 show that R- CNN with O-Net substantially outperforms R-CNN with T- Comparison to recent feature learning methods. Rela- Net, increasing mAP from 58.5% to 66.0%. However there tively few feature learning methods have been tried on PAS- is a considerable drawback in terms of compute time, with CAL VOC detection. We look at two recent approaches that the forward pass of O-Net taking roughly 7 times longer build on deformable part models. For reference, we also in- than T-Net. clude results for the standard HOG-based DPM [20]. The first DPM feature learning method, DPM ST [28], 3.4. Detection error analysis augments HOG features with histograms of “sketch token” We applied the excellent detection analysis tool from probabilities. Intuitively, a sketch token is a tight distri- Hoiem et al. [23] in order to reveal our method’s error bution of contours passing through the center of an image modes, understand how fine-tuning changes them, and to patch. Sketch token probabilities are computed at each pixel see how our error types compare with DPM. A full sum- by a random forest that was trained to classify 35 × 35 pixel mary of the analysis tool is beyond the scope of this pa- patches into one of 150 sketch tokens or background. per and we encourage readers to consult [23] to understand The second method, DPM HSC [31], replaces HOG with some finer details (such as “normalized AP”). Since the histograms of sparse codes (HSC). To compute an HSC, analysis is best absorbed in the context of the associated sparse code activations are solved for at each pixel using plots, we present the discussion within the captions of Fig- a learned dictionary of 100 7 × 7 pixel (grayscale) atoms. ure 5 and Figure 6. The resulting activations are rectified in three ways (full and both half-waves), spatially pooled, unit 2 normalized, and 3.5. Bounding-box regression then power transformed (x ← sign(x)|x|α ). Based on the error analysis, we implemented a sim- All R-CNN variants strongly outperform the three DPM ple method to reduce localization errors. Inspired by the baselines (Table 2 rows 8-10), including the two that use bounding-box regression employed in DPM [17], we train a feature learning. Compared to the latest version of DPM, linear regression model to predict a new detection window which uses only HOG features, our mAP is more than 20 given the pool5 features for a selective search region pro- percentage points higher: 54.2% vs. 33.7%—a 61% rela- posal. Full details are given in Appendix C. Results in Ta- tive improvement. The combination of HOG and sketch to- ble 1, Table 2, and Figure 5 show that this simple approach kens yields 2.5 mAP points over HOG alone, while HSC fixes a large number of mislocalized detections, boosting improves over HOG by 4 mAP points (when compared mAP by 3 to 4 points. internally to their private DPM baselines—both use non- public implementations of DPM that underperform the open 3.6. Qualitative results source version [20]). These methods achieve mAPs of Qualitative detection results on ILSVRC2013 are pre- 29.1% and 34.3%, respectively. sented in Figure 8 and Figure 9 at the end of the paper. Each image was sampled randomly from the val2 set and all de- 3.3. Network architectures tections from all detectors with a precision greater than 0.5 Most results in this paper use the network architecture are shown. Note that these are not curated and give a re- from Krizhevsky et al. [25]. However, we have found that alistic impression of the detectors in action. More qualita- the choice of architecture has a large effect on R-CNN de- tive results are presented in Figure 10 and Figure 11, but tection performance. In Table 3 we show results on VOC these have been curated. We selected each image because it 2007 test using the 16-layer deep network recently proposed contained interesting, surprising, or amusing results. Here, by Simonyan and Zisserman [43]. This network was one of also, all detections at precision greater than 0.5 are shown. the top performers in the recent ILSVRC 2014 classifica- tion challenge. The network has a homogeneous structure 4. The ILSVRC2013 detection dataset consisting of 13 layers of 3 × 3 convolution kernels, with In Section 2 we presented results on the ILSVRC2013 five max pooling layers interspersed, and topped with three detection dataset. This dataset is less homogeneous than fully-connected layers. We refer to this network as “O-Net” for OxfordNet and the baseline as “T-Net” for TorontoNet. 1 7

8. R−CNN fc6: sensitivity and impact R−CNN FT fc7: sensitivity and impact R−CNN FT fc7 BB: sensitivity and impact DPM voc−release5: sensitivity and impact 0.8 0.8 0.766 0.8 0.786 0.779 0.8 0.720 0.723 0.731 0.709 0.720 0.677 0.701 0.685 0.676 0.672 normalized AP normalized AP normalized AP normalized AP 0.612 0.606 0.609 0.634 0.633 0.6 0.6 0.593 0.6 0.6 0.557 0.542 0.516 0.498 0.487 0.484 0.442 0.429 0.453 0.453 0.4 0.420 0.4 0.4 0.4 0.391 0.388 0.385 0.368 0.344 0.351 0.335 0.325 0.339 0.347 0.297 0.244 0.216 0.2 0.212 0.201 0.2 0.2 0.211 0.2 0.179 0.132 0.126 0.137 0.094 0.056 0 0 0 0 occ trn size asp view part occ trn size asp view part occ trn size asp view part occ trn size asp view part Figure 6: Sensitivity to object characteristics. Each plot shows the mean (over classes) normalized AP (see [23]) for the highest and lowest performing subsets within six different object characteristics (occlusion, truncation, bounding-box area, aspect ratio, viewpoint, part visibility). We show plots for our method (R-CNN) with and without fine-tuning (FT) and bounding-box regression (BB) as well as for DPM voc-release5. Overall, fine-tuning does not reduce sensitivity (the difference between max and min), but does substantially improve both the highest and lowest performing subsets for nearly all characteristics. This indicates that fine-tuning does more than simply improve the lowest performing subsets for aspect ratio and bounding-box area, as one might conjecture based on how we warp network inputs. Instead, fine-tuning improves robustness for all characteristics including occlusion, truncation, viewpoint, and part visibility. R−CNN fc6: animals R−CNN FT fc7: animals R−CNN FT fc7 BB: animals 100 100 100 val and test splits are drawn from the same image distribu- percentage of each type percentage of each type percentage of each type 80 80 80 tion. These images are scene-like and similar in complexity 60 60 60 (number of objects, amount of clutter, pose variability, etc.) 40 40 40 to PASCAL VOC images. The val and test splits are exhaus- Loc Loc Loc Sim Sim Sim tively annotated, meaning that in each image all instances 20 Oth 20 Oth 20 Oth BG BG BG from all 200 classes are labeled with bounding boxes. The 0 0 0 25 100 400 1600 6400 25 total false positives 100 400 1600 6400 25 total false positives 100 400 1600 6400 total false positives train set, in contrast, is drawn from the ILSVRC2013 clas- R−CNN fc6: furniture R−CNN FT fc7: furniture R−CNN FT fc7 BB: furniture sification image distribution. These images have more vari- 100 100 100 able complexity with a skew towards images of a single cen- percentage of each type percentage of each type percentage of each type 80 80 80 tered object. Unlike val and test, the train images (due to 60 60 60 their large number) are not exhaustively annotated. In any 40 Loc 40 Loc 40 Loc given train image, instances from the 200 classes may or Sim Sim Sim 20 Oth 20 Oth 20 Oth may not be labeled. In addition to these image sets, each BG BG BG 0 25 100 400 1600 6400 25 0 100 400 1600 6400 25 0 100 400 1600 6400 class has an extra set of negative images. Negative images total false positives total false positives total false positives are manually checked to validate that they do not contain Figure 5: Distribution of top-ranked false positive (FP) types. any instances of their associated class. The negative im- Each plot shows the evolving distribution of FP types as more FPs age sets were not used in this work. More information on are considered in order of decreasing score. Each FP is catego- how ILSVRC was collected and annotated can be found in rized into 1 of 4 types: Loc—poor localization (a detection with [11, 36]. an IoU overlap with the correct class between 0.1 and 0.5, or a du- plicate); Sim—confusion with a similar category; Oth—confusion The nature of these splits presents a number of choices with a dissimilar object category; BG—a FP that fired on back- for training R-CNN. The train images cannot be used for ground. Compared with DPM (see [23]), significantly more of hard negative mining, because annotations are not exhaus- our errors result from poor localization, rather than confusion with tive. Where should negative examples come from? Also, background or other object classes, indicating that the CNN fea- the train images have different statistics than val and test. tures are much more discriminative than HOG. Loose localiza- tion likely results from our use of bottom-up region proposals and Should the train images be used at all, and if so, to what the positional invariance learned from pre-training the CNN for extent? While we have not thoroughly evaluated a large whole-image classification. Column three shows how our simple number of choices, we present what seemed like the most bounding-box regression method fixes many localization errors. obvious path based on previous experience. Our general strategy is to rely heavily on the val set and PASCAL VOC, requiring choices about how to use it. Since use some of the train images as an auxiliary source of pos- these decisions are non-trivial, we cover them in this sec- itive examples. To use val for both training and valida- tion. tion, we split it into roughly equally sized “val1 ” and “val2 ” sets. Since some classes have very few examples in val (the 4.1. Dataset overview smallest has only 31 and half have fewer than 110), it is The ILSVRC2013 detection dataset is split into three important to produce an approximately class-balanced par- sets: train (395,918), val (20,121), and test (40,152), where tition. To do this, a large number of candidate splits were the number of images in each set is in parentheses. The generated and the one with the smallest maximum relative 8

9.class imbalance was selected.2 Each candidate split was train because the annotations are not exhaustive. The ex- generated by clustering val images using their class counts tra sets of verified negative images were not used. The as features, followed by a randomized local search that may bounding-box regressors were trained on val1 . improve the split balance. The particular split used here has a maximum relative imbalance of about 11% and a median 4.4. Validation and evaluation relative imbalance of 4%. The val1 /val2 split and code used Before submitting results to the evaluation server, we to produce them will be publicly available to allow other re- validated data usage choices and the effect of fine-tuning searchers to compare their methods on the val splits used in and bounding-box regression on the val2 set using the train- this report. ing data described above. All system hyperparameters (e.g., SVM C hyperparameters, padding used in region warp- 4.2. Region proposals ing, NMS thresholds, bounding-box regression hyperpa- We followed the same region proposal approach that was rameters) were fixed at the same values used for PAS- used for detection on PASCAL. Selective search [39] was CAL. Undoubtedly some of these hyperparameter choices run in “fast mode” on each image in val1 , val2 , and test (but are slightly suboptimal for ILSVRC, however the goal of not on images in train). One minor modification was re- this work was to produce a preliminary R-CNN result on quired to deal with the fact that selective search is not scale ILSVRC without extensive dataset tuning. After selecting invariant and so the number of regions produced depends the best choices on val2 , we submitted exactly two result on the image resolution. ILSVRC image sizes range from files to the ILSVRC2013 evaluation server. The first sub- very small to a few that are several mega-pixels, and so we mission was without bounding-box regression and the sec- resized each image to a fixed width (500 pixels) before run- ond submission was with bounding-box regression. For ning selective search. On val, selective search resulted in an these submissions, we expanded the SVM and bounding- average of 2403 region proposals per image with a 91.6% box regressor training sets to use val+train1k and val, re- recall of all ground-truth bounding boxes (at 0.5 IoU thresh- spectively. We used the CNN that was fine-tuned on old). This recall is notably lower than in PASCAL, where val1 +train1k to avoid re-running fine-tuning and feature it is approximately 98%, indicating significant room for im- computation. provement in the region proposal stage. 4.5. Ablation study 4.3. Training data Table 4 shows an ablation study of the effects of differ- ent amounts of training data, fine-tuning, and bounding- For training data, we formed a set of images and boxes box regression. A first observation is that mAP on val2 that includes all selective search and ground-truth boxes matches mAP on test very closely. This gives us confi- from val1 together with up to N ground-truth boxes per dence that mAP on val2 is a good indicator of test set per- class from train (if a class has fewer than N ground-truth formance. The first result, 20.9%, is what R-CNN achieves boxes in train, then we take all of them). We’ll call this using a CNN pre-trained on the ILSVRC2012 classifica- dataset of images and boxes val1 +trainN . In an ablation tion dataset (no fine-tuning) and given access to the small study, we show mAP on val2 for N ∈ {0, 500, 1000} (Sec- amount of training data in val1 (recall that half of the classes tion 4.5). in val1 have between 15 and 55 examples). Expanding Training data is required for three procedures in R-CNN: the training set to val1 +trainN improves performance to (1) CNN fine-tuning, (2) detector SVM training, and (3) 24.1%, with essentially no difference between N = 500 bounding-box regressor training. CNN fine-tuning was run and N = 1000. Fine-tuning the CNN using examples from for 50k SGD iteration on val1 +trainN using the exact same just val1 gives a modest improvement to 26.5%, however settings as were used for PASCAL. Fine-tuning on a sin- there is likely significant overfitting due to the small number gle NVIDIA Tesla K20 took 13 hours using Caffe. For of positive training examples. Expanding the fine-tuning SVM training, all ground-truth boxes from val1 +trainN set to val1 +train1k , which adds up to 1000 positive exam- were used as positive examples for their respective classes. ples per class from the train set, helps significantly, boosting Hard negative mining was performed on a randomly se- mAP to 29.7%. Bounding-box regression improves results lected subset of 5000 images from val1 . An initial experi- to 31.0%, which is a smaller relative gain that what was ob- ment indicated that mining negatives from all of val1 , versus served in PASCAL. a 5000 image subset (roughly half of it), resulted in only a 0.5 percentage point drop in mAP, while cutting SVM train- 4.6. Relationship to OverFeat ing time in half. No negative examples were taken from There is an interesting relationship between R-CNN and 2 Relative imbalance is measured as |a − b|/(a + b) where a and b are OverFeat: OverFeat can be seen (roughly) as a special case class counts in each half of the split. of R-CNN. If one were to replace selective search region 9

10. test set val2 val2 val2 val2 val2 val2 test test SVM training set val1 val1 +train.5k val1 +train1k val1 +train1k val1 +train1k val1 +train1k val+train1k val+train1k CNN fine-tuning set n/a n/a n/a val1 val1 +train1k val1 +train1k val1 +train1k val1 +train1k bbox reg set n/a n/a n/a n/a n/a val1 n/a val CNN feature layer fc6 fc6 fc6 fc7 fc7 fc7 fc7 fc7 mAP 20.9 24.1 24.1 26.5 29.7 31.0 30.2 31.4 median AP 17.7 21.0 21.4 24.8 29.2 29.6 29.0 30.3 Table 4: ILSVRC2013 ablation study of data usage choices, fine-tuning, and bounding-box regression. proposals with a multi-scale pyramid of regular square re- gion’s shape and computes CNN features directly on the gions and change the per-class bounding-box regressors to warped window, exactly as we did for detection. However, a single bounding-box regressor, then the systems would these features ignore the non-rectangular shape of the re- be very similar (modulo some potentially significant differ- gion. Two regions might have very similar bounding boxes ences in how they are trained: CNN detection fine-tuning, while having very little overlap. Therefore, the second strat- using SVMs, etc.). It is worth noting that OverFeat has egy (fg) computes CNN features only on a region’s fore- a significant speed advantage over R-CNN: it is about 9x ground mask. We replace the background with the mean faster, based on a figure of 2 seconds per image quoted from input so that background regions are zero after mean sub- [34]. This speed comes from the fact that OverFeat’s slid- traction. The third strategy (full+fg) simply concatenates ing windows (i.e., region proposals) are not warped at the the full and fg features; our experiments validate their com- image level and therefore computation can be easily shared plementarity. between overlapping windows. Sharing is implemented by running the entire network in a convolutional fashion over full R-CNN fg R-CNN full+fg R-CNN arbitrary-sized inputs. Speeding up R-CNN should be pos- O2 P [4] fc6 fc7 fc6 fc7 fc6 fc7 sible in a variety of ways and remains as future work. 46.4 43.0 42.5 43.7 42.1 47.9 45.8 Table 5: Segmentation mean accuracy (%) on VOC 2011 vali- 5. Semantic segmentation dation. Column 1 presents O2 P; 2-7 use our CNN pre-trained on ILSVRC 2012. Region classification is a standard technique for seman- tic segmentation, allowing us to easily apply R-CNN to the PASCAL VOC segmentation challenge. To facilitate a di- rect comparison with the current leading semantic segmen- Results on VOC 2011. Table 5 shows a summary of our tation system (called O2 P for “second-order pooling”) [4], results on the VOC 2011 validation set compared with O2 P. we work within their open source framework. O2 P uses (See Appendix E for complete per-category results.) Within CPMC to generate 150 region proposals per image and then each feature computation strategy, layer fc6 always outper- predicts the quality of each region, for each class, using forms fc7 and the following discussion refers to the fc6 fea- support vector regression (SVR). The high performance of tures. The fg strategy slightly outperforms full, indicating their approach is due to the quality of the CPMC regions that the masked region shape provides a stronger signal, and the powerful second-order pooling of multiple feature matching our intuition. However, full+fg achieves an aver- types (enriched variants of SIFT and LBP). We also note age accuracy of 47.9%, our best result by a margin of 4.2% that Farabet et al. [16] recently demonstrated good results (also modestly outperforming O2 P), indicating that the con- on several dense scene labeling datasets (not including PAS- text provided by the full features is highly informative even CAL) using a CNN as a multi-scale per-pixel classifier. given the fg features. Notably, training the 20 SVRs on our We follow [2, 4] and extend the PASCAL segmentation full+fg features takes an hour on a single core, compared to training set to include the extra annotations made available 10+ hours for training on O2 P features. by Hariharan et al. [22]. Design decisions and hyperparam- In Table 6 we present results on the VOC 2011 test eters were cross-validated on the VOC 2011 validation set. set, comparing our best-performing method, fc6 (full+fg), Final test results were evaluated only once. against two strong baselines. Our method achieves the high- est segmentation accuracy for 11 out of 21 categories, and CNN features for segmentation. We evaluate three strate- the highest overall segmentation accuracy of 47.9%, aver- gies for computing features on CPMC regions, all of which aged across categories (but likely ties with the O2 P result begin by warping the rectangular window around the re- under any reasonable margin of error). Still better perfor- gion to 227 × 227. The first strategy (full) ignores the re- mance could likely be achieved by fine-tuning. 10

11.VOC 2011 test bg aero bike bird boat bottle bus car cat chair cow table dog horse mbike person plant sheep sofa train tv mean R&P [2] 83.4 46.8 18.9 36.6 31.2 42.7 57.3 47.4 44.1 8.1 39.4 36.1 36.3 49.5 48.3 50.7 26.3 47.2 22.1 42.0 43.2 40.8 O2 P [4] 85.4 69.7 22.3 45.2 44.4 46.9 66.7 57.8 56.2 13.5 46.1 32.3 41.2 59.1 55.3 51.0 36.2 50.4 27.8 46.9 44.6 47.6 ours (full+fg R-CNN fc6 ) 84.2 66.9 23.7 58.3 37.4 55.4 73.3 58.7 56.5 9.7 45.5 29.5 49.3 40.1 57.8 53.9 33.8 60.7 22.7 47.1 41.3 47.9 Table 6: Segmentation accuracy (%) on VOC 2011 test. We compare against two strong baselines: the “Regions and Parts” (R&P) method of [2] and the second-order pooling (O2 P) method of [4]. Without any fine-tuning, our CNN achieves top segmentation perfor- mance, outperforming R&P and roughly matching O2 P. 6. Conclusion In recent years, object detection performance had stag- nated. The best performing systems were complex en- sembles combining multiple low-level image features with high-level context from object detectors and scene classi- fiers. This paper presents a simple and scalable object de- tection algorithm that gives a 30% relative improvement over the best previous results on PASCAL VOC 2012. We achieved this performance through two insights. The first is to apply high-capacity convolutional neural net- works to bottom-up region proposals in order to localize (A) (B) (C) (D) (A) (B) (C) (D) and segment objects. The second is a paradigm for train- Figure 7: Different object proposal transformations. (A) the ing large CNNs when labeled training data is scarce. We original object proposal at its actual scale relative to the trans- show that it is highly effective to pre-train the network— formed CNN inputs; (B) tightest square with context; (C) tight- with supervision—for a auxiliary task with abundant data est square without context; (D) warp. Within each column and (image classification) and then to fine-tune the network for example proposal, the top row corresponds to p = 0 pixels of con- the target task where data is scarce (detection). We conjec- text padding while the bottom row has p = 16 pixels of context ture that the “supervised pre-training/domain-specific fine- padding. tuning” paradigm will be highly effective for a variety of data-scarce vision problems. We conclude by noting that it is significant that we achieved these results by using a combination of classi- then scales (isotropically) the image contained in that cal tools from computer vision and deep learning (bottom- square to the CNN input size. Figure 7 column (B) shows up region proposals and convolutional neural networks). this transformation. A variant on this method (“tightest Rather than opposing lines of scientific inquiry, the two are square without context”) excludes the image content that natural and inevitable partners. surrounds the original object proposal. Figure 7 column (C) shows this transformation. The second method (“warp”) Acknowledgments. This research was supported in part anisotropically scales each object proposal to the CNN in- by DARPA Mind’s Eye and MSEE programs, by NSF put size. Figure 7 column (D) shows the warp transforma- awards IIS-0905647, IIS-1134072, and IIS-1212798, tion. MURI N000014-10-1-0933, and by support from Toyota. The GPUs used in this research were generously donated For each of these transformations, we also consider in- by the NVIDIA Corporation. cluding additional image context around the original object proposal. The amount of context padding (p) is defined as a border size around the original object proposal in the trans- Appendix formed input coordinate frame. Figure 7 shows p = 0 pix- els in the top row of each example and p = 16 pixels in A. Object proposal transformations the bottom row. In all methods, if the source rectangle ex- tends beyond the image, the missing data is replaced with The convolutional neural network used in this work re- the image mean (which is then subtracted before inputing quires a fixed-size input of 227 × 227 pixels. For detec- the image into the CNN). A pilot set of experiments showed tion, we consider object proposals that are arbitrary image that warping with context padding (p = 16 pixels) outper- rectangles. We evaluated two approaches for transforming formed the alternatives by a large margin (3-5 mAP points). object proposals into valid CNN inputs. Obviously more alternatives are possible, including using The first method (“tightest square with context”) en- replication instead of mean padding. Exhaustive evaluation closes each object proposal inside the tightest square and of these alternatives is left as future work. 11

12.B. Positive vs. negative examples and softmax ter fine-tuning. We conjecture that with some additional tweaks to fine-tuning the remaining performance gap may Two design choices warrant further discussion. The first be closed. If true, this would simplify and speed up R-CNN is: Why are positive and negative examples defined differ- training with no loss in detection performance. ently for fine-tuning the CNN versus training the object de- tection SVMs? To review the definitions briefly, for fine- tuning we map each object proposal to the ground-truth in- C. Bounding-box regression stance with which it has maximum IoU overlap (if any) and We use a simple bounding-box regression stage to im- label it as a positive for the matched ground-truth class if the prove localization performance. After scoring each selec- IoU is at least 0.5. All other proposals are labeled “back- tive search proposal with a class-specific detection SVM, ground” (i.e., negative examples for all classes). For train- we predict a new bounding box for the detection using a ing SVMs, in contrast, we take only the ground-truth boxes class-specific bounding-box regressor. This is similar in as positive examples for their respective classes and label spirit to the bounding-box regression used in deformable proposals with less than 0.3 IoU overlap with all instances part models [17]. The primary difference between the two of a class as a negative for that class. Proposals that fall approaches is that here we regress from features computed into the grey zone (more than 0.3 IoU overlap, but are not by the CNN, rather than from geometric features computed ground truth) are ignored. on the inferred DPM part locations. Historically speaking, we arrived at these definitions be- The input to our training algorithm is a set of N train- cause we started by training SVMs on features computed ing pairs {(P i , Gi )}i=1,...,N , where P i = (Pxi , Pyi , Pwi , Phi ) by the ImageNet pre-trained CNN, and so fine-tuning was not a consideration at that point in time. In that setup, we specifies the pixel coordinates of the center of proposal P i ’s found that our particular label definition for training SVMs bounding box together with P i ’s width and height in pixels. was optimal within the set of options we evaluated (which Hence forth, we drop the superscript i unless it is needed. included the setting we now use for fine-tuning). When we Each ground-truth bounding box G is specified in the same started using fine-tuning, we initially used the same positive way: G = (Gx , Gy , Gw , Gh ). Our goal is to learn a trans- and negative example definition as we were using for SVM formation that maps a proposed box P to a ground-truth box training. However, we found that results were much worse G. than those obtained using our current definition of positives We parameterize the transformation in terms of four and negatives. functions dx (P ), dy (P ), dw (P ), and dh (P ). The first Our hypothesis is that this difference in how positives two specify a scale-invariant translation of the center of and negatives are defined is not fundamentally important P ’s bounding box, while the second two specify log-space and arises from the fact that fine-tuning data is limited. translations of the width and height of P ’s bounding box. Our current scheme introduces many “jittered” examples After learning these functions, we can transform an input proposal P into a predicted ground-truth box G ˆ by apply- (those proposals with overlap between 0.5 and 1, but not ground truth), which expands the number of positive exam- ing the transformation ples by approximately 30x. We conjecture that this large set is needed when fine-tuning the entire network to avoid ˆ x = Pw dx (P ) + Px G (1) overfitting. However, we also note that using these jittered ˆ y = Ph dy (P ) + Py G (2) examples is likely suboptimal because the network is not being fine-tuned for precise localization. ˆ w = Pw exp(dw (P )) G (3) This leads to the second issue: Why, after fine-tuning, ˆ h = Ph exp(dh (P )). G (4) train SVMs at all? It would be cleaner to simply apply the last layer of the fine-tuned network, which is a 21-way soft- max regression classifier, as the object detector. We tried Each function d (P ) (where is one of x, y, h, w) is this and found that performance on VOC 2007 dropped modeled as a linear function of the pool5 features of pro- from 54.2% to 50.9% mAP. This performance drop likely posal P , denoted by φ5 (P ). (The dependence of φ5 (P ) arises from a combination of several factors including that on the image data is implicitly assumed.) Thus we have the definition of positive examples used in fine-tuning does d (P ) = wT φ5 (P ), where w is a vector of learnable not emphasize precise localization and the softmax classi- model parameters. We learn w by optimizing the regu- fier was trained on randomly sampled negative examples larized least squares objective (ridge regression): rather than on the subset of “hard negatives” used for SVM training. N 2 This result shows that it’s possible to obtain close to w = argmin (ti − w ˆ T φ5 (P i ))2 + λ w ˆ . (5) ˆ w the same level of performance without training SVMs af- i 12

13.The regression targets t for the training pair (P, G) are de- F. Analysis of cross-dataset redundancy fined as One concern when training on an auxiliary dataset is that there might be redundancy between it and the test set. Even tx = (Gx − Px )/Pw (6) though the tasks of object detection and whole-image clas- ty = (Gy − Py )/Ph (7) sification are substantially different, making such cross-set tw = log(Gw /Pw ) (8) redundancy much less worrisome, we still conducted a thor- ough investigation that quantifies the extent to which PAS- th = log(Gh /Ph ). (9) CAL test images are contained within the ILSVRC 2012 training and validation sets. Our findings may be useful to As a standard regularized least squares problem, this can be researchers who are interested in using ILSVRC 2012 as solved efficiently in closed form. training data for the PASCAL image classification task. We found two subtle issues while implementing We performed two checks for duplicate (and near- bounding-box regression. The first is that regularization duplicate) images. The first test is based on exact matches is important: we set λ = 1000 based on a validation set. of flickr image IDs, which are included in the VOC 2007 The second issue is that care must be taken when selecting test annotations (these IDs are intentionally kept secret for which training pairs (P, G) to use. Intuitively, if P is far subsequent PASCAL test sets). All PASCAL images, and from all ground-truth boxes, then the task of transforming about half of ILSVRC, were collected from This P to a ground-truth box G does not make sense. Using ex- check turned up 31 matches out of 4952 (0.63%). amples like P would lead to a hopeless learning problem. The second check uses GIST [30] descriptor matching, Therefore, we only learn from a proposal P if it is nearby which was shown in [13] to have excellent performance at at least one ground-truth box. We implement “nearness” by near-duplicate image detection in large (> 1 million) image assigning P to the ground-truth box G with which it has collections. Following [13], we computed GIST descrip- maximum IoU overlap (in case it overlaps more than one) if tors on warped 32 × 32 pixel versions of all ILSVRC 2012 and only if the overlap is greater than a threshold (which we trainval and PASCAL 2007 test images. set to 0.6 using a validation set). All unassigned proposals Euclidean distance nearest-neighbor matching of GIST are discarded. We do this once for each object class in order descriptors revealed 38 near-duplicate images (including all to learn a set of class-specific bounding-box regressors. 31 found by flickr ID matching). The matches tend to vary At test time, we score each proposal and predict its new slightly in JPEG compression level and resolution, and to a detection window only once. In principle, we could iterate lesser extent cropping. These findings show that the overlap this procedure (i.e., re-score the newly predicted bounding is small, less than 1%. For VOC 2012, because flickr IDs box, and then predict a new bounding box from it, and so are not available, we used the GIST matching method only. on). However, we found that iterating does not improve Based on GIST matches, 1.5% of VOC 2012 test images results. are in ILSVRC 2012 trainval. The slightly higher rate for VOC 2012 is likely due to the fact that the two datasets D. Additional feature visualizations were collected closer together in time than VOC 2007 and ILSVRC 2012 were. Figure 12 shows additional visualizations for 20 pool5 units. For each unit, we show the 24 region proposals that G. Document changelog maximally activate that unit out of the full set of approxi- This document tracks the progress of R-CNN. To help mately 10 million regions in all of VOC 2007 test. readers understand how it has changed over time, here’s a We label each unit by its (y, x, channel) position in the brief changelog describing the revisions. 6 × 6 × 256 dimensional pool5 feature map. Within each channel, the CNN computes exactly the same function of v1 Initial version. the input region, with the (y, x) position changing only the v2 CVPR 2014 camera-ready revision. Includes substan- receptive field. tial improvements in detection performance brought about by (1) starting fine-tuning from a higher learning rate (0.001 E. Per-category segmentation results instead of 0.0001), (2) using context padding when prepar- ing CNN inputs, and (3) bounding-box regression to fix lo- In Table 7 we show the per-category segmentation ac- calization errors. curacy on VOC 2011 val for each of our six segmentation methods in addition to the O2 P method [4]. These results v3 Results on the ILSVRC2013 detection dataset and com- show which methods are strongest across each of the 20 parison with OverFeat were integrated into several sections PASCAL classes, plus the background class. (primarily Section 2 and Section 4). 13

14.VOC 2011 val bg aero bike bird boat bottle bus car cat chair cow table dog horse mbike person plant sheep sofa train tv mean O2 P [4] 84.0 69.0 21.7 47.7 42.2 42.4 64.7 65.8 57.4 12.9 37.4 20.5 43.7 35.7 52.7 51.0 35.8 51.0 28.4 59.8 49.7 46.4 full R-CNN fc6 81.3 56.2 23.9 42.9 40.7 38.8 59.2 56.5 53.2 11.4 34.6 16.7 48.1 37.0 51.4 46.0 31.5 44.0 24.3 53.7 51.1 43.0 full R-CNN fc7 81.0 52.8 25.1 43.8 40.5 42.7 55.4 57.7 51.3 8.7 32.5 11.5 48.1 37.0 50.5 46.4 30.2 42.1 21.2 57.7 56.0 42.5 fg R-CNN fc6 81.4 54.1 21.1 40.6 38.7 53.6 59.9 57.2 52.5 9.1 36.5 23.6 46.4 38.1 53.2 51.3 32.2 38.7 29.0 53.0 47.5 43.7 fg R-CNN fc7 80.9 50.1 20.0 40.2 34.1 40.9 59.7 59.8 52.7 7.3 32.1 14.3 48.8 42.9 54.0 48.6 28.9 42.6 24.9 52.2 48.8 42.1 full+fg R-CNN fc6 83.1 60.4 23.2 48.4 47.3 52.6 61.6 60.6 59.1 10.8 45.8 20.9 57.7 43.3 57.4 52.9 34.7 48.7 28.1 60.0 48.6 47.9 full+fg R-CNN fc7 82.3 56.7 20.6 49.9 44.2 43.6 59.3 61.3 57.8 7.7 38.4 15.1 53.4 43.7 50.8 52.0 34.1 47.8 24.7 60.1 55.2 45.7 Table 7: Per-category segmentation accuracy (%) on the VOC 2011 validation set. v4 The softmax vs. SVM results in Appendix B contained [13] M. Douze, H. J´egou, H. Sandhawalia, L. Amsaleg, and an error, which has been fixed. We thank Sergio Guadar- C. Schmid. Evaluation of gist descriptors for web-scale im- rama for helping to identify this issue. age search. In Proc. of the ACM International Conference on Image and Video Retrieval, 2009. 13 v5 Added results using the new 16-layer network architec- [14] I. Endres and D. Hoiem. Category independent object pro- ture from Simonyan and Zisserman [43] to Section 3.3 and posals. In ECCV, 2010. 3 Table 3. [15] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. The PASCAL Visual Object Classes (VOC) References Challenge. IJCV, 2010. 1, 4 [16] C. Farabet, C. Couprie, L. Najman, and Y. LeCun. Learning [1] B. Alexe, T. Deselaers, and V. Ferrari. Measuring the object- hierarchical features for scene labeling. TPAMI, 2013. 10 ness of image windows. TPAMI, 2012. 2 [17] P. Felzenszwalb, R. Girshick, D. McAllester, and D. Ra- [2] P. Arbel´aez, B. Hariharan, C. Gu, S. Gupta, L. Bourdev, and manan. Object detection with discriminatively trained part J. Malik. Semantic segmentation using regions and parts. In based models. TPAMI, 2010. 2, 4, 7, 12 CVPR, 2012. 10, 11 [18] S. Fidler, R. Mottaghi, A. Yuille, and R. Urtasun. Bottom-up [3] P. Arbel´aez, J. Pont-Tuset, J. Barron, F. Marques, and J. Ma- segmentation for top-down detection. In CVPR, 2013. 4, 5 lik. Multiscale combinatorial grouping. In CVPR, 2014. 3 [19] K. Fukushima. Neocognitron: A self-organizing neu- [4] J. Carreira, R. Caseiro, J. Batista, and C. Sminchisescu. Se- ral network model for a mechanism of pattern recogni- mantic segmentation with second-order pooling. In ECCV, tion unaffected by shift in position. Biological cybernetics, 2012. 4, 10, 11, 13, 14 36(4):193–202, 1980. 1 [5] J. Carreira and C. Sminchisescu. CPMC: Automatic ob- [20] R. Girshick, P. Felzenszwalb, and D. McAllester. Discrimi- ject segmentation using constrained parametric min-cuts. natively trained deformable part models, release 5. http: TPAMI, 2012. 2, 3 //˜rbg/latent-v5/. 2, [6] D. Cires¸an, A. Giusti, L. Gambardella, and J. Schmidhu- 5, 6, 7 ber. Mitosis detection in breast cancer histology images with [21] C. Gu, J. J. Lim, P. Arbel´aez, and J. Malik. Recognition deep neural networks. In MICCAI, 2013. 3 using regions. In CVPR, 2009. 2 [7] N. Dalal and B. Triggs. Histograms of oriented gradients for [22] B. Hariharan, P. Arbel´aez, L. Bourdev, S. Maji, and J. Malik. human detection. In CVPR, 2005. 1 Semantic contours from inverse detectors. In ICCV, 2011. [8] T. Dean, M. A. Ruzon, M. Segal, J. Shlens, S. Vijaya- 10 narasimhan, and J. Yagnik. Fast, accurate detection of [23] D. Hoiem, Y. Chodpathumwan, and Q. Dai. Diagnosing error 100,000 object classes on a single machine. In CVPR, 2013. in object detectors. In ECCV. 2012. 2, 7, 8 3 [24] Y. Jia. Caffe: An open source convolutional archi- [9] J. Deng, A. Berg, S. Satheesh, H. Su, A. Khosla, and L. Fei- tecture for fast feature embedding. http://caffe. Fei. ImageNet Large Scale Visual Recognition Competition, 2013. 3 2012 (ILSVRC2012). [25] A. Krizhevsky, I. Sutskever, and G. Hinton. ImageNet clas- challenges/LSVRC/2012/. 1 sification with deep convolutional neural networks. In NIPS, [10] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei- 2012. 1, 3, 4, 7 Fei. ImageNet: A large-scale hierarchical image database. [26] Y. LeCun, B. Boser, J. Denker, D. Henderson, R. Howard, In CVPR, 2009. 1 W. Hubbard, and L. Jackel. Backpropagation applied to [11] J. Deng, O. Russakovsky, J. Krause, M. Bernstein, A. C. handwritten zip code recognition. Neural Comp., 1989. 1 Berg, and L. Fei-Fei. Scalable multi-label annotation. In [27] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient- CHI, 2014. 8 based learning applied to document recognition. Proc. of the [12] J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang, IEEE, 1998. 1 E. Tzeng, and T. Darrell. DeCAF: A Deep Convolutional [28] J. J. Lim, C. L. Zitnick, and P. Doll´ar. Sketch tokens: A Activation Feature for Generic Visual Recognition. In ICML, learned mid-level representation for contour and object de- 2014. 2 tection. In CVPR, 2013. 6, 7 14

15.class AP class AP class AP class AP class AP accordion 50.8 centipede 30.4 hair spray 13.8 pencil box 11.4 snowplow 69.2 airplane 50.0 chain saw 14.1 hamburger 34.2 pencil sharpener 9.0 soap dispenser 16.8 ant 31.8 chair 19.5 hammer 9.9 perfume 32.8 soccer ball 43.7 antelope 53.8 chime 24.6 hamster 46.0 person 41.7 sofa 16.3 apple 30.9 cocktail shaker 46.2 harmonica 12.6 piano 20.5 spatula 6.8 armadillo 54.0 coffee maker 21.5 harp 50.4 pineapple 22.6 squirrel 31.3 artichoke 45.0 computer keyboard 39.6 hat with a wide brim 40.5 ping-pong ball 21.0 starfish 45.1 axe 11.8 computer mouse 21.2 head cabbage 17.4 pitcher 19.2 stethoscope 18.3 baby bed 42.0 corkscrew 24.2 helmet 33.4 pizza 43.7 stove 8.1 backpack 2.8 cream 29.9 hippopotamus 38.0 plastic bag 6.4 strainer 9.9 bagel 37.5 croquet ball 30.0 horizontal bar 7.0 plate rack 15.2 strawberry 26.8 balance beam 32.6 crutch 23.7 horse 41.7 pomegranate 32.0 stretcher 13.2 banana 21.9 cucumber 22.8 hotdog 28.7 popsicle 21.2 sunglasses 18.8 band aid 17.4 cup or mug 34.0 iPod 59.2 porcupine 37.2 swimming trunks 9.1 banjo 55.3 diaper 10.1 isopod 19.5 power drill 7.9 swine 45.3 baseball 41.8 digital clock 18.5 jellyfish 23.7 pretzel 24.8 syringe 5.7 basketball 65.3 dishwasher 19.9 koala bear 44.3 printer 21.3 table 21.7 bathing cap 37.2 dog 76.8 ladle 3.0 puck 14.1 tape player 21.4 beaker 11.3 domestic cat 44.1 ladybug 58.4 punching bag 29.4 tennis ball 59.1 bear 62.7 dragonfly 27.8 lamp 9.1 purse 8.0 tick 42.6 bee 52.9 drum 19.9 laptop 35.4 rabbit 71.0 tie 24.6 bell pepper 38.8 dumbbell 14.1 lemon 33.3 racket 16.2 tiger 61.8 bench 12.7 electric fan 35.0 lion 51.3 ray 41.1 toaster 29.2 bicycle 41.1 elephant 56.4 lipstick 23.1 red panda 61.1 traffic light 24.7 binder 6.2 face powder 22.1 lizard 38.9 refrigerator 14.0 train 60.8 bird 70.9 fig 44.5 lobster 32.4 remote control 41.6 trombone 13.8 bookshelf 19.3 filing cabinet 20.6 maillot 31.0 rubber eraser 2.5 trumpet 14.4 bow tie 38.8 flower pot 20.2 maraca 30.1 rugby ball 34.5 turtle 59.1 bow 9.0 flute 4.9 microphone 4.0 ruler 11.5 tv or monitor 41.7 bowl 26.7 fox 59.3 microwave 40.1 salt or pepper shaker 24.6 unicycle 27.2 brassiere 31.2 french horn 24.2 milk can 33.3 saxophone 40.8 vacuum 19.5 burrito 25.7 frog 64.1 miniskirt 14.9 scorpion 57.3 violin 13.7 bus 57.5 frying pan 21.5 monkey 49.6 screwdriver 10.6 volleyball 59.7 butterfly 88.5 giant panda 42.5 motorcycle 42.2 seal 20.9 waffle iron 24.0 camel 37.6 goldfish 28.6 mushroom 31.8 sheep 48.9 washer 39.8 can opener 28.9 golf ball 51.3 nail 4.5 ski 9.0 water bottle 8.1 car 44.5 golfcart 47.9 neck brace 31.6 skunk 57.9 watercraft 40.9 cart 48.0 guacamole 32.3 oboe 27.5 snail 36.2 whale 48.6 cattle 32.3 guitar 33.1 orange 38.8 snake 33.8 wine bottle 31.2 cello 28.9 hair dryer 13.0 otter 22.2 snowmobile 58.8 zebra 49.6 Table 8: Per-class average precision (%) on the ILSVRC2013 detection test set. [29] D. Lowe. Distinctive image features from scale-invariant A holistic representation of the spatial envelope. IJCV, 2001. keypoints. IJCV, 2004. 1 13 [30] A. Oliva and A. Torralba. Modeling the shape of the scene: [31] X. Ren and D. Ramanan. Histograms of sparse codes for 15

16. cocktail shaker 0.56 person 0.88 helmet 0.65 dog 0.95 dog 0.97 person 0.72 dog 0.97 dog 0.85 dog 0.57 bird 0.63 dog 0.64 lemon 0.79 lemon 0.70 lemon 0.56 lemon 0.50 person 0.82 bird 0.96 dog 0.66 domestic cat 0.57 dog 0.61 helmet person 0.52 0.75 snowmobile 0.83 motorcycle 0.65 snowmobile 0.83 person 0.58 bow tie 0.86 bird 0.61 ladybug 1.00 person 0.87 sofa 0.71 dog 0.91 dog 0.77 dog 0.95 dog 0.55 pretzel 0.78 bird 0.98 car 0.63car 0.96 person 0.52 car 0.66 bird 0.91 watercraft 1.00 bird 0.99 person 0.65 car 0.96 watercraft 0.69 person 0.52 bird 0.75 person 0.58 person 0.65 armadillo 0.56 train 1.00 flower pot 0.62 dog 0.97 dog 0.56 train 0.53 armadillo 1.00 dog 0.98 dog 0.92 swine 0.88 bird 0.93 bird 1.00 butterfly 0.96 antelope 0.53 tv or monitor 0.82 person 0.90 tv or monitor 0.76 tv or monitor 0.54 snake 0.70 mushroom 0.93 bell pepper 0.54 turtle 0.54 flower pot 0.62 bell pepper 0.62 bell pepper 0.81 ruler 1.00 dog 0.97 person 0.58 lipstick 0.61 lipstick 0.80 bird 0.89 soccer ball 0.90 Figure 8: Example detections on the val2 set from the configuration that achieved 31.0% mAP on val2 . Each image was sampled randomly (these are not curated). All detections at precision greater than 0.5 are shown. Each detection is labeled with the predicted class and the precision value of that detection from the detector’s precision-recall curve. Viewing digitally with zoom is recommended. 16

17. helmet baby bed 0.51 0.55 watercraft 0.55 pitcher 0.57 monkey 0.97 table 0.60 bird 0.52 hat with a wide brim 0.78 person 0.86 dog 0.98 table 0.68 person 0.88 person 0.87 sunglasses 0.51 person 0.51 car 0.61 dog 0.97 swinemonkey 0.50 0.87 bird 0.52 monkey 0.81 dog 0.55 dog 0.94 dog 0.97 hat with a wide brim 0.96 person 0.77 snake 0.74 dog 0.93 table 0.54 person 0.52 person 0.85 zebra 0.55 zebra 0.83 zebra 0.80 dog 0.71 zebra 0.52 pretzel 0.69 ladybug 0.90 guacamole 0.64 person 0.58 person 0.85 dog 0.98 dog 0.98 person 0.73 hat with a wide brim 0.60 person 0.81 elephant 1.00 bird 0.99 computer keyboard 0.52 dog 0.97 dog 0.92 bird 0.94 cart 1.00 person 0.87 person 0.91 person 0.77 person 0.57 chair chair 0.79 0.64 person 0.52 butterfly 0.98 person 0.91 person 0.75 person 0.73 bird 0.83 bird 1.00 stethoscope 0.83 person 0.61 bird 0.78 Figure 9: More randomly selected examples. See Figure 8 caption for details. Viewing digitally with zoom is recommended. 17

18. person 0.73 lemon orange0.88 0.73 person 0.51 pineapple 1.00 bowl 0.63 guacamole tennis ball 0.60 1.00 orange 0.71 lemon 0.78 person 0.81 motorcycle 0.64 person 0.57 lemon 0.80 orange 0.78 lemon 0.86 person 0.53 bagel 0.57 lamp 0.61 soccer ball 0.67 golf ball 0.81 bee 0.85 person 0.52 jellyfish 0.71 dumbbell 1.00 golf ball 0.51 golf ball 0.79 bowl 0.54 golf ball 0.89 golf ball 0.76 golf ball0.53 lemon 0.60 golf ball 1.00 hamburger 0.78 golf ball 0.60 table 0.59 golf ball 1.00 person 0.85 goldfish 0.76 head cabbage 0.75 microwave 0.60 person 0.57 guitar 1.00 head cabbage 0.83 tick 0.64 guitar 1.00 microphone 1.00 guitar 0.88 table 0.53 dog 0.74 table 0.63 computer keyboard 0.78 person 0.81 person 0.92 dog 0.98 rabbit 1.00 tennis ball 0.67 lemon 0.80 watercraft 0.86 sunglasses 0.52 milk can 1.00 milk can 1.00 person 0.87 antelope 0.74 dog 0.87 bookshelf 0.50 horse 0.78 cattle 0.81 pomegranate 1.00 giant panda 0.61 chair 0.86 tv or monitor 0.52 dog 0.88 bird 0.94 antelope 0.68 snake 0.60 chair 0.86 person 0.79 dog 0.98 snake 0.76 lamp 0.65lamp 0.86 watercraft 0.91 fox 0.81 dog 0.88 fox 1.00 monkey 1.00 monkey 1.00 table 0.83 monkey 0.52 monkey 0.88 tv or monitor tv 0.80 or monitor 0.54 monkey 0.90 tv or monitor 0.58 table 0.62 watercraft 0.56 person 0.88 dragonfly 0.70 electric fan 0.83 bird 0.69 hamburger 0.60 hamburger 0.72 dragonfly 0.60 cup or mug 0.72 isopod 0.56 bird 0.95 starfish 0.67 bird 0.78 soccer ball 0.63 electric helmetfan 0.78 0.64 electric fan 1.00 Figure 10: Curated examples. Each image was selected because we found it impressive, surprising, interesting, or amusing. Viewing digitally with zoom is recommended. 18

19. object detection. In CVPR, 2013. 6, 7 [32] H. A. Rowley, S. Baluja, and T. Kanade. Neural network- based face detection. TPAMI, 1998. 2 [33] D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Learn- ing internal representations by error propagation. Parallel Distributed Processing, 1:318–362, 1986. 1 [34] P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, and Y. LeCun. OverFeat: Integrated Recognition, Localiza- tion and Detection using Convolutional Networks. In ICLR, 2014. 1, 2, 4, 10 [35] P. Sermanet, K. Kavukcuoglu, S. Chintala, and Y. LeCun. Pedestrian detection with unsupervised multi-stage feature learning. In CVPR, 2013. 2 [36] H. Su, J. Deng, and L. Fei-Fei. Crowdsourcing annotations for visual object detection. In AAAI Technical Report, 4th Human Computation Workshop, 2012. 8 [37] K. Sung and T. Poggio. Example-based learning for view- based human face detection. Technical Report A.I. Memo No. 1521, Massachussets Institute of Technology, 1994. 4 [38] C. Szegedy, A. Toshev, and D. Erhan. Deep neural networks for object detection. In NIPS, 2013. 2 [39] J. Uijlings, K. van de Sande, T. Gevers, and A. Smeulders. Selective search for object recognition. IJCV, 2013. 1, 2, 3, 4, 5, 9 [40] R. Vaillant, C. Monrocq, and Y. LeCun. Original approach for the localisation of objects in images. IEE Proc on Vision, Image, and Signal Processing, 1994. 2 [41] X. Wang, M. Yang, S. Zhu, and Y. Lin. Regionlets for generic object detection. In ICCV, 2013. 3, 5 [42] M. Zeiler, G. Taylor, and R. Fergus. Adaptive deconvolu- tional networks for mid and high level feature learning. In CVPR, 2011. 4 [43] K. Simonyan and A. Zisserman. Very Deep Convolu- tional Networks for Large-Scale Image Recognition. arXiv preprint, arXiv:1409.1556, 2014. 6, 7, 14 19

20. person 0.82 snake 0.76 person 0.94 person 0.95 person 0.60 person 0.92 person 0.67 goldfish 0.76 stethoscope 0.56 bird 0.79 frog 0.78 goldfish 0.76 goldfish 0.58 table 0.81 watercraft 0.55 person 0.94 person 0.80 jellyfish 0.67 tv or monitor 0.82 person 0.55 person 0.68 lemon 0.52 person 0.78 person 0.59 person 0.65 person 0.52 lizard 0.58 person 0.61 person 0.82 dog 0.60 person 0.88 person 0.79 computer keyboard 0.81 baseball 1.00 person 0.74 person 0.69 person 0.79 person 0.94 volleyball 0.70 person person 0.80 0.58 person 0.79 pineapple 1.00 person 0.81 person 0.56 person 0.80 person 0.54 person 0.94 person 0.66 person 0.84 person 0.59 person 0.94 person 0.94 person 0.95 person 0.95 table 0.82 person 0.69 person 0.81 brassiere 0.71 chair 0.50 swimming trunks 0.56 rugby ball 0.91 person 0.92 baseball 0.86 person 0.75 tiger 1.00 tiger 0.59 helmet 0.74 dog 0.98 vacuum 1.00 dog 0.93 bird 0.55 person 0.75 tiger 0.67 person 0.94 person 0.65 miniskirt 0.64 person 0.53 ski 0.80 ski 0.80 bowl 0.52 person 0.78 person 0.82 bird 0.56 strawberry 0.79 whale 1.00 strawberry 0.70 burrito 0.54 person 0.92 person 0.92 chair 0.53 croquet croquet ball 0.91ball 0.91 croquetcroquet mushroomball0.57 0.91ball 0.91 plastic bag 0.62 tv or monitor 0.57 watercraft 0.87 plastic bag 0.62 dog 0.94 cart 0.80 person 0.53 person 0.79 whale 0.88 watercraft 0.91 car 0.70 watercraft 0.58 antelope antelope 1.00 0.63 bird 0.59 antelope 1.00 hat with person aperson 0.54 0.880.89 wide brim traffic light 0.79 person 0.79 antelope 0.63 horizontal bar 1.00 balance beam 0.50 person 0.82 antelope 0.73 person 0.80 fox 0.57 person 0.56 cucumber 0.53 antelope 0.94 cucumber 0.52 helmet 0.69 person 0.82 orange 0.56 person 0.90 dog 0.97 orange 0.66 bird 0.96 bird 0.64 horse 0.92 bird 0.89 bird 0.53 dog 0.98 bird bird0.52 0.96 snake 0.64 birdbird 0.97 0.56 person 0.72 horse 0.69 bird 0.94 orange 0.66orange 0.79 orange 0.59 orange 0.71 person 0.83 elephant 0.60 person 0.82 guitar 1.00 person 0.74 person 0.54 person 0.83 person 0.80 car 1.00 car 0.97 person 0.90 dog 0.85 bicycle 0.92 dog 0.86 dog 0.50 dog 0.65 dog 0.98 Figure 11: More curated examples. See Figure 10 caption for details. Viewing digitally with zoom is recommended. 20

21. pool5 feature: (3,3,1) (top 1 − 24) pool5 feature: (3,3,2) (top 1 − 24) 1.0 0.9 0.8 0.8 0.7 0.7 0.7 0.7 0.7 0.7 0.7 0.7 1.0 0.9 0.9 0.9 0.9 0.8 0.8 0.7 0.7 0.7 0.7 0.7 0.7 0.7 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.7 0.7 0.7 0.7 0.7 0.7 0.7 0.7 0.7 0.7 0.7 0.7 pool5 feature: (3,3,3) (top 1 − 24) pool5 feature: (3,3,4) (top 1 − 24) 0.9 0.8 0.8 0.8 0.8 0.8 0.8 0.7 0.7 0.7 0.6 0.6 0.9 0.8 0.7 0.7 0.7 0.7 0.7 0.7 0.7 0.7 0.7 0.7 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.7 0.7 0.7 0.7 0.7 0.7 0.7 0.6 0.6 0.6 0.6 0.6 pool5 feature: (3,3,5) (top 1 − 24) pool5 feature: (3,3,6) (top 1 − 24) 0.9 0.8 0.8 0.8 0.8 0.8 0.8 0.8 0.7 0.7 0.7 0.7 0.9 0.8 0.8 0.8 0.8 0.7 0.7 0.7 0.7 0.7 0.7 0.7 0.7 0.7 0.7 0.7 0.7 0.7 0.7 0.7 0.7 0.7 0.7 0.7 0.7 0.7 0.7 0.7 0.7 0.7 0.7 0.7 0.7 0.7 0.7 0.7 pool5 feature: (3,3,7) (top 1 − 24) pool5 feature: (3,3,8) (top 1 − 24) 0.9 0.8 0.8 0.8 0.8 0.8 0.7 0.7 0.7 0.7 0.7 0.7 0.9 0.8 0.8 0.8 0.8 0.8 0.8 0.7 0.7 0.7 0.7 0.7 0.7 0.7 0.7 0.7 0.7 0.7 0.7 0.7 0.6 0.6 0.6 0.6 0.7 0.7 0.7 0.7 0.7 0.7 0.7 0.7 0.7 0.7 0.7 0.7 pool5 feature: (3,3,9) (top 1 − 24) pool5 feature: (3,3,10) (top 1 − 24) 0.8 0.8 0.8 0.7 0.7 0.7 0.7 0.7 0.7 0.7 0.7 0.7 0.9 0.8 0.8 0.7 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.7 0.7 0.7 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.5 0.5 pool5 feature: (3,3,11) (top 1 − 24) pool5 feature: (3,3,12) (top 1 − 24) 0.7 0.7 0.7 0.7 0.7 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.9 0.8 0.7 0.7 0.7 0.7 0.7 0.7 0.7 0.7 0.7 0.7 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.7 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.6 pool5 feature: (3,3,13) (top 1 − 24) pool5 feature: (3,3,14) (top 1 − 24) 0.9 0.9 0.8 0.8 0.8 0.8 0.8 0.8 0.8 0.8 0.8 0.8 0.9 0.9 0.9 0.8 0.8 0.8 0.8 0.8 0.8 0.8 0.8 0.8 0.8 0.8 0.8 0.8 0.8 0.8 0.8 0.8 0.8 0.8 0.8 0.8 0.8 0.7 0.7 0.7 0.7 0.7 0.7 0.7 0.7 0.7 0.7 0.7 pool5 feature: (3,3,15) (top 1 − 24) pool5 feature: (3,3,16) (top 1 − 24) 0.8 0.8 0.8 0.8 0.8 0.8 0.8 0.8 0.8 0.8 0.8 0.8 0.9 0.8 0.8 0.7 0.7 0.7 0.7 0.7 0.7 0.7 0.7 0.7 0.7 0.7 0.7 0.7 0.7 0.7 0.7 0.7 0.7 0.7 0.7 0.7 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.6 pool5 feature: (3,3,17) (top 1 − 24) pool5 feature: (3,3,18) (top 1 − 24) 0.9 0.9 0.8 0.8 0.8 0.8 0.7 0.7 0.7 0.7 0.7 0.7 0.8 0.7 0.7 0.7 0.7 0.7 0.7 0.7 0.7 0.7 0.6 0.6 0.7 0.7 0.7 0.7 0.7 0.7 0.7 0.7 0.7 0.7 0.7 0.7 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.6 pool5 feature: (3,3,19) (top 1 − 24) pool5 feature: (3,3,20) (top 1 − 24) 0.9 0.8 0.8 0.7 0.7 0.7 0.7 0.7 0.7 0.6 0.6 0.6 1.0 0.9 0.7 0.7 0.7 0.7 0.7 0.7 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.6 Figure 12: We show the 24 region proposals, out of the approximately 10 million regions in VOC 2007 test, that most strongly activate each of 20 units. Each montage is labeled by the unit’s (y, x, channel) position in the 6 × 6 × 256 dimensional pool5 feature map. Each image region is drawn with an overlay of the unit’s receptive field in white. The activation value (which we normalize by dividing by the max activation value over all units in a channel) is shown in the receptive field’s upper-left corner. Best viewed digitally with zoom. 21