Faster R-CNN

State-of-the-art object detection networks depend on region proposal algorithms to hypothesize object locations.Advances like SPPnet [1] and Fast R-CNN [2] have reduced the running time of these detection networks, exposing region proposal computation as a bottleneck. In this work, we introduce a Region Proposal Network (RPN) that shares full-image convolutional features with the detection network, thus enabling nearly cost-free region proposals. An RPN is a fully convolutional network that simultaneously predicts object bounds and objectness scores at each position. The RPN is trained end-to-end to generate high-quality region proposals, which are used by Fast R-CNN for detection. We further merge RPN and Fast R-CNN into a single network by sharing their convolutional features—using the recently popular terminology of neural networks with “attention” mechanisms, the RPN component tells the unified network where to look. For the very deep VGG-16 model [3], our detection system has a frame rate of 5fps (including all steps) on a GPU, while achieving state-of-the-art object detection accuracy on PASCAL VOC 2007, 2012, and MS COCO datasets with only 300 proposals per image. In ILSVRC and COCO 2015 competitions, Faster R-CNN and RPN are the foundations of the 1st-place winning entries in several tracks. Code has been made publicly available.
展开查看详情

1. 1 Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun Abstract—State-of-the-art object detection networks depend on region proposal algorithms to hypothesize object locations. Advances like SPPnet [1] and Fast R-CNN [2] have reduced the running time of these detection networks, exposing region proposal computation as a bottleneck. In this work, we introduce a Region Proposal Network (RPN) that shares full-image convolutional features with the detection network, thus enabling nearly cost-free region proposals. An RPN is a fully convolutional network that simultaneously predicts object bounds and objectness scores at each position. The RPN is trained end-to-end to arXiv:1506.01497v3 [cs.CV] 6 Jan 2016 generate high-quality region proposals, which are used by Fast R-CNN for detection. We further merge RPN and Fast R-CNN into a single network by sharing their convolutional features—using the recently popular terminology of neural networks with “attention” mechanisms, the RPN component tells the unified network where to look. For the very deep VGG-16 model [3], our detection system has a frame rate of 5fps (including all steps) on a GPU, while achieving state-of-the-art object detection accuracy on PASCAL VOC 2007, 2012, and MS COCO datasets with only 300 proposals per image. In ILSVRC and COCO 2015 competitions, Faster R-CNN and RPN are the foundations of the 1st-place winning entries in several tracks. Code has been made publicly available. Index Terms—Object Detection, Region Proposal, Convolutional Neural Network. ✦ 1 I NTRODUCTION One may note that fast region-based CNNs take Recent advances in object detection are driven by advantage of GPUs, while the region proposal meth- the success of region proposal methods (e.g., [4]) ods used in research are implemented on the CPU, and region-based convolutional neural networks (R- making such runtime comparisons inequitable. An ob- CNNs) [5]. Although region-based CNNs were com- vious way to accelerate proposal computation is to re- putationally expensive as originally developed in [5], implement it for the GPU. This may be an effective en- their cost has been drastically reduced thanks to shar- gineering solution, but re-implementation ignores the ing convolutions across proposals [1], [2]. The latest down-stream detection network and therefore misses incarnation, Fast R-CNN [2], achieves near real-time important opportunities for sharing computation. rates using very deep networks [3], when ignoring the In this paper, we show that an algorithmic change— time spent on region proposals. Now, proposals are the computing proposals with a deep convolutional neu- test-time computational bottleneck in state-of-the-art ral network—leads to an elegant and effective solution detection systems. where proposal computation is nearly cost-free given the detection network’s computation. To this end, we Region proposal methods typically rely on inex- introduce novel Region Proposal Networks (RPNs) that pensive features and economical inference schemes. share convolutional layers with state-of-the-art object Selective Search [4], one of the most popular meth- detection networks [1], [2]. By sharing convolutions at ods, greedily merges superpixels based on engineered test-time, the marginal cost for computing proposals low-level features. Yet when compared to efficient is small (e.g., 10ms per image). detection networks [2], Selective Search is an order of Our observation is that the convolutional feature magnitude slower, at 2 seconds per image in a CPU maps used by region-based detectors, like Fast R- implementation. EdgeBoxes [6] currently provides the CNN, can also be used for generating region pro- best tradeoff between proposal quality and speed, posals. On top of these convolutional features, we at 0.2 seconds per image. Nevertheless, the region construct an RPN by adding a few additional con- proposal step still consumes as much running time volutional layers that simultaneously regress region as the detection network. bounds and objectness scores at each location on a • S. Ren is with University of Science and Technology of China, Hefei, regular grid. The RPN is thus a kind of fully convo- China. This work was done when S. Ren was an intern at Microsoft lutional network (FCN) [7] and can be trained end-to- Research. Email: sqren@mail.ustc.edu.cn end specifically for the task for generating detection • K. He and J. Sun are with Visual Computing Group, Microsoft Research. E-mail: {kahe,jiansun}@microsoft.com proposals. • R. Girshick is with Facebook AI Research. The majority of this work RPNs are designed to efficiently predict region pro- was done when R. Girshick was with Microsoft Research. E-mail: posals with a wide range of scales and aspect ratios. In rbg@fb.com contrast to prevalent methods [8], [9], [1], [2] that use

2. 2 multiple filter sizes multiple references feature map feature map feature map multiple scaled images image image image (a) (b) (c) Figure 1: Different schemes for addressing multiple scales and sizes. (a) Pyramids of images and feature maps are built, and the classifier is run at all scales. (b) Pyramids of filters with multiple scales/sizes are run on the feature map. (c) We use pyramids of reference boxes in the regression functions. pyramids of images (Figure 1, a) or pyramids of filters mercial systems such as at Pinterests [17], with user (Figure 1, b), we introduce novel “anchor” boxes engagement improvements reported. that serve as references at multiple scales and aspect In ILSVRC and COCO 2015 competitions, Faster ratios. Our scheme can be thought of as a pyramid R-CNN and RPN are the basis of several 1st-place of regression references (Figure 1, c), which avoids entries [18] in the tracks of ImageNet detection, Ima- enumerating images or filters of multiple scales or geNet localization, COCO detection, and COCO seg- aspect ratios. This model performs well when trained mentation. RPNs completely learn to propose regions and tested using single-scale images and thus benefits from data, and thus can easily benefit from deeper running speed. and more expressive features (such as the 101-layer To unify RPNs with Fast R-CNN [2] object detec- residual nets adopted in [18]). Faster R-CNN and RPN tion networks, we propose a training scheme that are also used by several other leading entries in these alternates between fine-tuning for the region proposal competitions2 . These results suggest that our method task and then fine-tuning for object detection, while is not only a cost-efficient solution for practical usage, keeping the proposals fixed. This scheme converges but also an effective way of improving object detec- quickly and produces a unified network with convo- tion accuracy. lutional features that are shared between both tasks.1 We comprehensively evaluate our method on the PASCAL VOC detection benchmarks [11] where RPNs 2 R ELATED W ORK with Fast R-CNNs produce detection accuracy bet- Object Proposals. There is a large literature on object ter than the strong baseline of Selective Search with proposal methods. Comprehensive surveys and com- Fast R-CNNs. Meanwhile, our method waives nearly parisons of object proposal methods can be found in all computational burdens of Selective Search at [19], [20], [21]. Widely used object proposal methods test-time—the effective running time for proposals include those based on grouping super-pixels (e.g., is just 10 milliseconds. Using the expensive very Selective Search [4], CPMC [22], MCG [23]) and those deep models of [3], our detection method still has based on sliding windows (e.g., objectness in windows a frame rate of 5fps (including all steps) on a GPU, [24], EdgeBoxes [6]). Object proposal methods were and thus is a practical object detection system in adopted as external modules independent of the de- terms of both speed and accuracy. We also report tectors (e.g., Selective Search [4] object detectors, R- results on the MS COCO dataset [12] and investi- CNN [5], and Fast R-CNN [2]). gate the improvements on PASCAL VOC using the Deep Networks for Object Detection. The R-CNN COCO data. Code has been made publicly available method [5] trains CNNs end-to-end to classify the at https://github.com/shaoqingren/faster_ proposal regions into object categories or background. rcnn (in MATLAB) and https://github.com/ R-CNN mainly plays as a classifier, and it does not rbgirshick/py-faster-rcnn (in Python). predict object bounds (except for refining by bounding A preliminary version of this manuscript was pub- box regression). Its accuracy depends on the perfor- lished previously [10]. Since then, the frameworks of mance of the region proposal module (see compar- RPN and Faster R-CNN have been adopted and gen- isons in [20]). Several papers have proposed ways of eralized to other methods, such as 3D object detection using deep networks for predicting object bounding [13], part-based detection [14], instance segmentation boxes [25], [9], [26], [27]. In the OverFeat method [9], [15], and image captioning [16]. Our fast and effective a fully-connected layer is trained to predict the box object detection system has also been built in com- coordinates for the localization task that assumes a 1. Since the publication of the conference version of this paper single object. The fully-connected layer is then turned [10], we have also found that RPNs can be trained jointly with Fast R-CNN networks leading to less training time. 2. http://image-net.org/challenges/LSVRC/2015/results

3. 3 classifier single, unified network for object detection (Figure 2). Using the recently popular terminology of neural RoI pooling networks with ‘attention’ [31] mechanisms, the RPN module tells the Fast R-CNN module where to look. In Section 3.1 we introduce the designs and properties proposals of the network for region proposal. In Section 3.2 we develop algorithms for training both modules with features shared. Region Proposal Network 3.1 Region Proposal Networks feature maps A Region Proposal Network (RPN) takes an image (of any size) as input and outputs a set of rectangular object proposals, each with an objectness score.3 We model this process with a fully convolutional network [7], which we describe in this section. Because our ulti- conv layers mate goal is to share computation with a Fast R-CNN object detection network [2], we assume that both nets share a common set of convolutional layers. In our ex- image periments, we investigate the Zeiler and Fergus model Figure 2: Faster R-CNN is a single, unified network [32] (ZF), which has 5 shareable convolutional layers for object detection. The RPN module serves as the and the Simonyan and Zisserman model [3] (VGG-16), ‘attention’ of this unified network. which has 13 shareable convolutional layers. To generate region proposals, we slide a small network over the convolutional feature map output into a convolutional layer for detecting multiple class- by the last shared convolutional layer. This small specific objects. The MultiBox methods [26], [27] gen- network takes as input an n × n spatial window of erate region proposals from a network whose last the input convolutional feature map. Each sliding fully-connected layer simultaneously predicts mul- window is mapped to a lower-dimensional feature tiple class-agnostic boxes, generalizing the “single- (256-d for ZF and 512-d for VGG, with ReLU [33] box” fashion of OverFeat. These class-agnostic boxes following). This feature is fed into two sibling fully- are used as proposals for R-CNN [5]. The MultiBox connected layers—a box-regression layer (reg) and a proposal network is applied on a single image crop or box-classification layer (cls). We use n = 3 in this multiple large image crops (e.g., 224×224), in contrast paper, noting that the effective receptive field on the to our fully convolutional scheme. MultiBox does not input image is large (171 and 228 pixels for ZF and share features between the proposal and detection VGG, respectively). This mini-network is illustrated networks. We discuss OverFeat and MultiBox in more at a single position in Figure 3 (left). Note that be- depth later in context with our method. Concurrent cause the mini-network operates in a sliding-window with our work, the DeepMask method [28] is devel- fashion, the fully-connected layers are shared across oped for learning segmentation proposals. all spatial locations. This architecture is naturally im- Shared computation of convolutions [9], [1], [29], plemented with an n × n convolutional layer followed [7], [2] has been attracting increasing attention for ef- by two sibling 1 × 1 convolutional layers (for reg and ficient, yet accurate, visual recognition. The OverFeat cls, respectively). paper [9] computes convolutional features from an 3.1.1 Anchors image pyramid for classification, localization, and de- At each sliding-window location, we simultaneously tection. Adaptively-sized pooling (SPP) [1] on shared predict multiple region proposals, where the number convolutional feature maps is developed for efficient of maximum possible proposals for each location is region-based object detection [1], [30] and semantic denoted as k. So the reg layer has 4k outputs encoding segmentation [29]. Fast R-CNN [2] enables end-to-end the coordinates of k boxes, and the cls layer outputs detector training on shared convolutional features and 2k scores that estimate probability of object or not shows compelling accuracy and speed. object for each proposal4 . The k proposals are param- eterized relative to k reference boxes, which we call 3 FASTER R-CNN 3. “Region” is a generic term and in this paper we only consider Our object detection system, called Faster R-CNN, is rectangular regions, as is common for many methods (e.g., [27], [4], composed of two modules. The first module is a deep [6]). “Objectness” measures membership to a set of object classes fully convolutional network that proposes regions, vs. background. 4. For simplicity we implement the cls layer as a two-class and the second module is the Fast R-CNN detector [2] softmax layer. Alternatively, one may use logistic regression to that uses the proposed regions. The entire system is a produce k scores.

4. 4 person : 0.992 2k scores 4k coordinates k anchor boxes dog : 0.994 horse : 0.993 cls layer reg layer car : 1.000 cat : 0.982 dog : 0.997 person : 0.979 256-d intermediate layer bus : 0.996 boat : 0.970 person : 0.983 person : 0.736 person : 0.983 person : 0.925 person : 0.989 sliding window conv feature map Figure 3: Left: Region Proposal Network (RPN). Right: Example detections using RPN proposals on PASCAL VOC 2007 test. Our method detects objects in a wide range of scales and aspect ratios. anchors. An anchor is centered at the sliding window Multi-Scale Anchors as Regression References in question, and is associated with a scale and aspect Our design of anchors presents a novel scheme ratio (Figure 3, left). By default we use 3 scales and for addressing multiple scales (and aspect ratios). As 3 aspect ratios, yielding k = 9 anchors at each sliding shown in Figure 1, there have been two popular ways position. For a convolutional feature map of a size for multi-scale predictions. The first way is based on W × H (typically ∼2,400), there are W Hk anchors in image/feature pyramids, e.g., in DPM [8] and CNN- total. based methods [9], [1], [2]. The images are resized at multiple scales, and feature maps (HOG [8] or deep Translation-Invariant Anchors convolutional features [9], [1], [2]) are computed for An important property of our approach is that it each scale (Figure 1(a)). This way is often useful but is translation invariant, both in terms of the anchors is time-consuming. The second way is to use sliding and the functions that compute proposals relative to windows of multiple scales (and/or aspect ratios) on the anchors. If one translates an object in an image, the feature maps. For example, in DPM [8], models the proposal should translate and the same function of different aspect ratios are trained separately using should be able to predict the proposal in either lo- different filter sizes (such as 5×7 and 7×5). If this way cation. This translation-invariant property is guaran- is used to address multiple scales, it can be thought teed by our method5 . As a comparison, the MultiBox of as a “pyramid of filters” (Figure 1(b)). The second method [27] uses k-means to generate 800 anchors, way is usually adopted jointly with the first way [8]. which are not translation invariant. So MultiBox does As a comparison, our anchor-based method is built not guarantee that the same proposal is generated if on a pyramid of anchors, which is more cost-efficient. an object is translated. Our method classifies and regresses bounding boxes The translation-invariant property also reduces the with reference to anchor boxes of multiple scales and model size. MultiBox has a (4 + 1) × 800-dimensional aspect ratios. It only relies on images and feature fully-connected output layer, whereas our method has maps of a single scale, and uses filters (sliding win- a (4 + 2) × 9-dimensional convolutional output layer dows on the feature map) of a single size. We show by in the case of k = 9 anchors. As a result, our output experiments the effects of this scheme for addressing layer has 2.8 × 104 parameters (512 × (4 + 2) × 9 multiple scales and sizes (Table 8). for VGG-16), two orders of magnitude fewer than Because of this multi-scale design based on anchors, MultiBox’s output layer that has 6.1 × 106 parameters we can simply use the convolutional features com- (1536 × (4 + 1) × 800 for GoogleNet [34] in MultiBox puted on a single-scale image, as is also done by [27]). If considering the feature projection layers, our the Fast R-CNN detector [2]. The design of multi- proposal layers still have an order of magnitude fewer scale anchors is a key component for sharing features parameters than MultiBox6 . We expect our method without extra cost for addressing scales. to have less risk of overfitting on small datasets, like PASCAL VOC. 3.1.2 Loss Function For training RPNs, we assign a binary class label 5. As is the case of FCNs [7], our network is translation invariant (of being an object or not) to each anchor. We as- up to the network’s total stride. sign a positive label to two kinds of anchors: (i) the 6. Considering the feature projection layers, our proposal layers’ anchor/anchors with the highest Intersection-over- parameter count is 3 × 3 × 512 × 512 + 512 × 6 × 9 = 2.4 × 106 ; MultiBox’s proposal layers’ parameter count is 7 × 7 × (64 + 96 + Union (IoU) overlap with a ground-truth box, or (ii) an 64 + 64) × 1536 + 1536 × 5 × 800 = 27 × 106 . anchor that has an IoU overlap higher than 0.7 with

5. 5 any ground-truth box. Note that a single ground-truth be thought of as bounding-box regression from an box may assign positive labels to multiple anchors. anchor box to a nearby ground-truth box. Usually the second condition is sufficient to determine Nevertheless, our method achieves bounding-box the positive samples; but we still adopt the first regression by a different manner from previous RoI- condition for the reason that in some rare cases the based (Region of Interest) methods [1], [2]. In [1], second condition may find no positive sample. We [2], bounding-box regression is performed on features assign a negative label to a non-positive anchor if its pooled from arbitrarily sized RoIs, and the regression IoU ratio is lower than 0.3 for all ground-truth boxes. weights are shared by all region sizes. In our formula- Anchors that are neither positive nor negative do not tion, the features used for regression are of the same contribute to the training objective. spatial size (3 × 3) on the feature maps. To account With these definitions, we minimize an objective for varying sizes, a set of k bounding-box regressors function following the multi-task loss in Fast R-CNN are learned. Each regressor is responsible for one scale [2]. Our loss function for an image is defined as: and one aspect ratio, and the k regressors do not share weights. As such, it is still possible to predict boxes of 1 L({pi }, {ti }) = Lcls (pi , p∗i ) various sizes even though the features are of a fixed Ncls i size/scale, thanks to the design of anchors. (1) 1 +λ p∗i Lreg (ti , t∗i ). Nreg 3.1.3 Training RPNs i The RPN can be trained end-to-end by back- Here, i is the index of an anchor in a mini-batch and propagation and stochastic gradient descent (SGD) pi is the predicted probability of anchor i being an [35]. We follow the “image-centric” sampling strategy object. The ground-truth label p∗i is 1 if the anchor from [2] to train this network. Each mini-batch arises is positive, and is 0 if the anchor is negative. ti is a from a single image that contains many positive and vector representing the 4 parameterized coordinates negative example anchors. It is possible to optimize of the predicted bounding box, and t∗i is that of the for the loss functions of all anchors, but this will ground-truth box associated with a positive anchor. bias towards negative samples as they are dominate. The classification loss Lcls is log loss over two classes Instead, we randomly sample 256 anchors in an image (object vs. not object). For the regression loss, we use to compute the loss function of a mini-batch, where Lreg (ti , t∗i ) = R(ti − t∗i ) where R is the robust loss the sampled positive and negative anchors have a function (smooth L1 ) defined in [2]. The term p∗i Lreg ratio of up to 1:1. If there are fewer than 128 positive means the regression loss is activated only for positive samples in an image, we pad the mini-batch with anchors (p∗i = 1) and is disabled otherwise (p∗i = 0). negative ones. The outputs of the cls and reg layers consist of {pi } We randomly initialize all new layers by drawing and {ti } respectively. weights from a zero-mean Gaussian distribution with The two terms are normalized by Ncls and Nreg standard deviation 0.01. All other layers (i.e., the and weighted by a balancing parameter λ. In our shared convolutional layers) are initialized by pre- current implementation (as in the released code), the training a model for ImageNet classification [36], as cls term in Eqn.(1) is normalized by the mini-batch is standard practice [5]. We tune all layers of the size (i.e., Ncls = 256) and the reg term is normalized ZF net, and conv3 1 and up for the VGG net to by the number of anchor locations (i.e., Nreg ∼ 2, 400). conserve memory [2]. We use a learning rate of 0.001 By default we set λ = 10, and thus both cls and for 60k mini-batches, and 0.0001 for the next 20k reg terms are roughly equally weighted. We show mini-batches on the PASCAL VOC dataset. We use a by experiments that the results are insensitive to the momentum of 0.9 and a weight decay of 0.0005 [37]. values of λ in a wide range (Table 9). We also note Our implementation uses Caffe [38]. that the normalization as above is not required and could be simplified. For bounding box regression, we adopt the param- 3.2 Sharing Features for RPN and Fast R-CNN eterizations of the 4 coordinates following [5]: Thus far we have described how to train a network for region proposal generation, without considering tx = (x − xa )/wa , ty = (y − ya )/ha , the region-based object detection CNN that will utilize tw = log(w/wa ), th = log(h/ha ), these proposals. For the detection network, we adopt (2) t∗x = (x∗ − xa )/wa , t∗y = (y ∗ − ya )/ha , Fast R-CNN [2]. Next we describe algorithms that t∗w = log(w∗ /wa ), t∗h = log(h∗ /ha ), learn a unified network composed of RPN and Fast R-CNN with shared convolutional layers (Figure 2). where x, y, w, and h denote the box’s center coordi- Both RPN and Fast R-CNN, trained independently, nates and its width and height. Variables x, xa , and will modify their convolutional layers in different x∗ are for the predicted box, anchor box, and ground- ways. We therefore need to develop a technique that truth box respectively (likewise for y, w, h). This can allows for sharing convolutional layers between the

6. 6 Table 1: the learned average proposal size for each anchor using the ZF net (numbers for s = 600). anchor 1282 , 2:1 1282 , 1:1 1282 , 1:2 2562 , 2:1 2562 , 1:1 2562 , 1:2 5122 , 2:1 5122 , 1:1 5122 , 1:2 proposal 188×111 113×114 70×92 416×229 261×284 174×332 768×437 499×501 355×715 two networks, rather than learning two separate net- fix the shared convolutional layers and only fine-tune works. We discuss three ways for training networks the layers unique to RPN. Now the two networks with features shared: share convolutional layers. Finally, keeping the shared (i) Alternating training. In this solution, we first train convolutional layers fixed, we fine-tune the unique RPN, and use the proposals to train Fast R-CNN. layers of Fast R-CNN. As such, both networks share The network tuned by Fast R-CNN is then used to the same convolutional layers and form a unified initialize RPN, and this process is iterated. This is the network. A similar alternating training can be run solution that is used in all experiments in this paper. for more iterations, but we have observed negligible (ii) Approximate joint training. In this solution, the improvements. RPN and Fast R-CNN networks are merged into one 3.3 Implementation Details network during training as in Figure 2. In each SGD iteration, the forward pass generates region propos- We train and test both region proposal and object als which are treated just like fixed, pre-computed detection networks on images of a single scale [1], [2]. proposals when training a Fast R-CNN detector. The We re-scale the images such that their shorter side backward propagation takes place as usual, where for is s = 600 pixels [2]. Multi-scale feature extraction the shared layers the backward propagated signals (using an image pyramid) may improve accuracy but from both the RPN loss and the Fast R-CNN loss does not exhibit a good speed-accuracy trade-off [2]. are combined. This solution is easy to implement. But On the re-scaled images, the total stride for both ZF this solution ignores the derivative w.r.t. the proposal and VGG nets on the last convolutional layer is 16 boxes’ coordinates that are also network responses, pixels, and thus is ∼10 pixels on a typical PASCAL so is approximate. In our experiments, we have em- image before resizing (∼500×375). Even such a large pirically found this solver produces close results, yet stride provides good results, though accuracy may be reduces the training time by about 25-50% comparing further improved with a smaller stride. with alternating training. This solver is included in For anchors, we use 3 scales with box areas of 1282 , our released Python code. 2562 , and 5122 pixels, and 3 aspect ratios of 1:1, 1:2, and 2:1. These hyper-parameters are not carefully cho- (iii) Non-approximate joint training. As discussed sen for a particular dataset, and we provide ablation above, the bounding boxes predicted by RPN are experiments on their effects in the next section. As dis- also functions of the input. The RoI pooling layer cussed, our solution does not need an image pyramid [2] in Fast R-CNN accepts the convolutional features or filter pyramid to predict regions of multiple scales, and also the predicted bounding boxes as input, so saving considerable running time. Figure 3 (right) a theoretically valid backpropagation solver should shows the capability of our method for a wide range also involve gradients w.r.t. the box coordinates. These of scales and aspect ratios. Table 1 shows the learned gradients are ignored in the above approximate joint average proposal size for each anchor using the ZF training. In a non-approximate joint training solution, net. We note that our algorithm allows predictions we need an RoI pooling layer that is differentiable that are larger than the underlying receptive field. w.r.t. the box coordinates. This is a nontrivial problem Such predictions are not impossible—one may still and a solution can be given by an “RoI warping” layer roughly infer the extent of an object if only the middle as developed in [15], which is beyond the scope of this of the object is visible. paper. The anchor boxes that cross image boundaries need 4-Step Alternating Training. In this paper, we adopt to be handled with care. During training, we ignore a pragmatic 4-step training algorithm to learn shared all cross-boundary anchors so they do not contribute features via alternating optimization. In the first step, to the loss. For a typical 1000 × 600 image, there we train the RPN as described in Section 3.1.3. This will be roughly 20000 (≈ 60 × 40 × 9) anchors in network is initialized with an ImageNet-pre-trained total. With the cross-boundary anchors ignored, there model and fine-tuned end-to-end for the region pro- are about 6000 anchors per image for training. If the posal task. In the second step, we train a separate boundary-crossing outliers are not ignored in training, detection network by Fast R-CNN using the proposals they introduce large, difficult to correct error terms in generated by the step-1 RPN. This detection net- the objective, and training does not converge. During work is also initialized by the ImageNet-pre-trained testing, however, we still apply the fully convolutional model. At this point the two networks do not share RPN to the entire image. This may generate cross- convolutional layers. In the third step, we use the boundary proposal boxes, which we clip to the image detector network to initialize RPN training, but we boundary.

7. 7 Table 2: Detection results on PASCAL VOC 2007 test set (trained on VOC 2007 trainval). The detectors are Fast R-CNN with ZF, but using various proposal methods for training and testing. train-time region proposals test-time region proposals method # boxes method # proposals mAP (%) SS 2000 SS 2000 58.7 EB 2000 EB 2000 58.6 RPN+ZF, shared 2000 RPN+ZF, shared 300 59.9 ablation experiments follow below RPN+ZF, unshared 2000 RPN+ZF, unshared 300 58.7 SS 2000 RPN+ZF 100 55.1 SS 2000 RPN+ZF 300 56.8 SS 2000 RPN+ZF 1000 56.3 SS 2000 RPN+ZF (no NMS) 6000 55.2 SS 2000 RPN+ZF (no cls) 100 44.6 SS 2000 RPN+ZF (no cls) 300 51.4 SS 2000 RPN+ZF (no cls) 1000 55.8 SS 2000 RPN+ZF (no reg) 300 52.1 SS 2000 RPN+ZF (no reg) 1000 51.3 SS 2000 RPN+VGG 300 59.2 Some RPN proposals highly overlap with each IoU. SS has an mAP of 58.7% and EB has an mAP other. To reduce redundancy, we adopt non-maximum of 58.6% under the Fast R-CNN framework. RPN suppression (NMS) on the proposal regions based on with Fast R-CNN achieves competitive results, with their cls scores. We fix the IoU threshold for NMS an mAP of 59.9% while using up to 300 proposals8 . at 0.7, which leaves us about 2000 proposal regions Using RPN yields a much faster detection system than per image. As we will show, NMS does not harm the using either SS or EB because of shared convolutional ultimate detection accuracy, but substantially reduces computations; the fewer proposals also reduce the the number of proposals. After NMS, we use the region-wise fully-connected layers’ cost (Table 5). top-N ranked proposal regions for detection. In the Ablation Experiments on RPN. To investigate the be- following, we train Fast R-CNN using 2000 RPN pro- havior of RPNs as a proposal method, we conducted posals, but evaluate different numbers of proposals at several ablation studies. First, we show the effect of test-time. sharing convolutional layers between the RPN and Fast R-CNN detection network. To do this, we stop 4 E XPERIMENTS after the second step in the 4-step training process. 4.1 Experiments on PASCAL VOC Using separate networks reduces the result slightly to We comprehensively evaluate our method on the 58.7% (RPN+ZF, unshared, Table 2). We observe that PASCAL VOC 2007 detection benchmark [11]. This this is because in the third step when the detector- dataset consists of about 5k trainval images and 5k tuned features are used to fine-tune the RPN, the test images over 20 object categories. We also provide proposal quality is improved. results on the PASCAL VOC 2012 benchmark for a Next, we disentangle the RPN’s influence on train- few models. For the ImageNet pre-trained network, ing the Fast R-CNN detection network. For this pur- we use the “fast” version of ZF net [32] that has pose, we train a Fast R-CNN model by using the 5 convolutional layers and 3 fully-connected layers, 2000 SS proposals and ZF net. We fix this detector and the public VGG-16 model7 [3] that has 13 con- and evaluate the detection mAP by changing the volutional layers and 3 fully-connected layers. We proposal regions used at test-time. In these ablation primarily evaluate detection mean Average Precision experiments, the RPN does not share features with (mAP), because this is the actual metric for object the detector. detection (rather than focusing on object proposal Replacing SS with 300 RPN proposals at test-time proxy metrics). leads to an mAP of 56.8%. The loss in mAP is because Table 2 (top) shows Fast R-CNN results when of the inconsistency between the training/testing pro- trained and tested using various region proposal posals. This result serves as the baseline for the fol- methods. These results use the ZF net. For Selective lowing comparisons. Search (SS) [4], we generate about 2000 proposals by Somewhat surprisingly, the RPN still leads to a the “fast” mode. For EdgeBoxes (EB) [6], we generate competitive result (55.1%) when using the top-ranked the proposals by the default EB setting tuned for 0.7 8. For RPN, the number of proposals (e.g., 300) is the maximum number for an image. RPN may produce fewer proposals after 7. www.robots.ox.ac.uk/∼vgg/research/very deep/ NMS, and thus the average number of proposals is smaller.

8. 8 Table 3: Detection results on PASCAL VOC 2007 test set. The detector is Fast R-CNN and VGG-16. Training data: “07”: VOC 2007 trainval, “07+12”: union set of VOC 2007 trainval and VOC 2012 trainval. For RPN, the train-time proposals for Fast R-CNN are 2000. † : this number was reported in [2]; using the repository provided by this paper, this result is higher (68.1). method # proposals data mAP (%) SS 2000 07 66.9† SS 2000 07+12 70.0 RPN+VGG, unshared 300 07 68.5 RPN+VGG, shared 300 07 69.9 RPN+VGG, shared 300 07+12 73.2 RPN+VGG, shared 300 COCO+07+12 78.8 Table 4: Detection results on PASCAL VOC 2012 test set. The detector is Fast R-CNN and VGG-16. Training data: “07”: VOC 2007 trainval, “07++12”: union set of VOC 2007 trainval+test and VOC 2012 trainval. For RPN, the train-time proposals for Fast R-CNN are 2000. † : http://host.robots.ox.ac.uk:8080/anonymous/HZJTQA.html. ‡ : http://host.robots.ox.ac.uk:8080/anonymous/YNPLXB.html. § : http://host.robots.ox.ac.uk:8080/anonymous/XEDH10.html. method # proposals data mAP (%) SS 2000 12 65.7 SS 2000 07++12 68.4 RPN+VGG, shared† 300 12 67.0 RPN+VGG, shared‡ 300 07++12 70.4 RPN+VGG, shared§ 300 COCO+07++12 75.9 Table 5: Timing (ms) on a K40 GPU, except SS proposal is evaluated in a CPU. “Region-wise” includes NMS, pooling, fully-connected, and softmax layers. See our released code for the profiling of running time. model system conv proposal region-wise total rate VGG SS + Fast R-CNN 146 1510 174 1830 0.5 fps VGG RPN + Fast R-CNN 141 10 47 198 5 fps ZF RPN + Fast R-CNN 31 3 25 59 17 fps 100 proposals at test-time, indicating that the top- (using RPN+ZF) to 59.2% (using RPN+VGG). This is a ranked RPN proposals are accurate. On the other promising result, because it suggests that the proposal extreme, using the top-ranked 6000 RPN proposals quality of RPN+VGG is better than that of RPN+ZF. (without NMS) has a comparable mAP (55.2%), sug- Because proposals of RPN+ZF are competitive with gesting NMS does not harm the detection mAP and SS (both are 58.7% when consistently used for training may reduce false alarms. and testing), we may expect RPN+VGG to be better Next, we separately investigate the roles of RPN’s than SS. The following experiments justify this hy- cls and reg outputs by turning off either of them pothesis. at test-time. When the cls layer is removed at test- Performance of VGG-16. Table 3 shows the results time (thus no NMS/ranking is used), we randomly of VGG-16 for both proposal and detection. Using sample N proposals from the unscored regions. The RPN+VGG, the result is 68.5% for unshared features, mAP is nearly unchanged with N = 1000 (55.8%), but slightly higher than the SS baseline. As shown above, degrades considerably to 44.6% when N = 100. This this is because the proposals generated by RPN+VGG shows that the cls scores account for the accuracy of are more accurate than SS. Unlike SS that is pre- the highest ranked proposals. defined, the RPN is actively trained and benefits from On the other hand, when the reg layer is removed better networks. For the feature-shared variant, the at test-time (so the proposals become anchor boxes), result is 69.9%—better than the strong SS baseline, yet the mAP drops to 52.1%. This suggests that the high- with nearly cost-free proposals. We further train the quality proposals are mainly due to the regressed box RPN and detection network on the union set of PAS- bounds. The anchor boxes, though having multiple CAL VOC 2007 trainval and 2012 trainval. The mAP scales and aspect ratios, are not sufficient for accurate is 73.2%. Figure 5 shows some results on the PASCAL detection. VOC 2007 test set. On the PASCAL VOC 2012 test set We also evaluate the effects of more powerful net- (Table 4), our method has an mAP of 70.4% trained works on the proposal quality of RPN alone. We use on the union set of VOC 2007 trainval+test and VOC VGG-16 to train the RPN, and still use the above 2012 trainval. Table 6 and Table 7 show the detailed detector of SS+ZF. The mAP improves from 56.8% numbers.

9. 9 Table 6: Results on PASCAL VOC 2007 test set with Fast R-CNN detectors and VGG-16. For RPN, the train-time proposals for Fast R-CNN are 2000. RPN∗ denotes the unsharing feature version. method # box data mAP areo bike bird boat bottle bus car cat chair cow table dog horse mbike person plant sheep sofa train tv SS 2000 07 66.9 74.5 78.3 69.2 53.2 36.6 77.3 78.2 82.0 40.7 72.7 67.9 79.6 79.2 73.0 69.0 30.1 65.4 70.2 75.8 65.8 SS 2000 07+12 70.0 77.0 78.1 69.3 59.4 38.3 81.6 78.6 86.7 42.8 78.8 68.9 84.7 82.0 76.6 69.9 31.8 70.1 74.8 80.4 70.4 RPN∗ 300 07 68.5 74.1 77.2 67.7 53.9 51.0 75.1 79.2 78.9 50.7 78.0 61.1 79.1 81.9 72.2 75.9 37.2 71.4 62.5 77.4 66.4 RPN 300 07 69.9 70.0 80.6 70.1 57.3 49.9 78.2 80.4 82.0 52.2 75.3 67.2 80.3 79.8 75.0 76.3 39.1 68.3 67.3 81.1 67.6 RPN 300 07+12 73.2 76.5 79.0 70.9 65.5 52.1 83.1 84.7 86.4 52.0 81.9 65.7 84.8 84.6 77.5 76.7 38.8 73.6 73.9 83.0 72.6 RPN 300 COCO+07+12 78.8 84.3 82.0 77.7 68.9 65.7 88.1 88.4 88.9 63.6 86.3 70.8 85.9 87.6 80.1 82.3 53.6 80.4 75.8 86.6 78.9 Table 7: Results on PASCAL VOC 2012 test set with Fast R-CNN detectors and VGG-16. For RPN, the train-time proposals for Fast R-CNN are 2000. method # box data mAP areo bike bird boat bottle bus car cat chair cow table dog horse mbike person plant sheep sofa train tv SS 2000 12 65.7 80.3 74.7 66.9 46.9 37.7 73.9 68.6 87.7 41.7 71.1 51.1 86.0 77.8 79.8 69.8 32.1 65.5 63.8 76.4 61.7 SS 2000 07++12 68.4 82.3 78.4 70.8 52.3 38.7 77.8 71.6 89.3 44.2 73.0 55.0 87.5 80.5 80.8 72.0 35.1 68.3 65.7 80.4 64.2 RPN 300 12 67.0 82.3 76.4 71.0 48.4 45.2 72.1 72.3 87.3 42.2 73.7 50.0 86.8 78.7 78.4 77.4 34.5 70.1 57.1 77.1 58.9 RPN 300 07++12 70.4 84.9 79.8 74.3 53.9 49.8 77.5 75.9 88.5 45.6 77.1 55.3 86.9 81.7 80.9 79.6 40.1 72.6 60.9 81.2 61.5 RPN 300 COCO+07++12 75.9 87.4 83.6 76.8 62.9 59.6 81.9 82.0 91.3 54.9 82.6 59.0 89.0 85.5 84.7 84.1 52.2 78.9 65.5 85.4 70.2 Table 8: Detection results of Faster R-CNN on PAS- 3 scales and 3 aspect ratios (69.9% mAP in Table 8). CAL VOC 2007 test set using different settings of If using just one anchor at each position, the mAP anchors. The network is VGG-16. The training data drops by a considerable margin of 3-4%. The mAP is VOC 2007 trainval. The default setting of using 3 is higher if using 3 scales (with 1 aspect ratio) or 3 scales and 3 aspect ratios (69.9%) is the same as that aspect ratios (with 1 scale), demonstrating that using in Table 3. anchors of multiple sizes as the regression references settings anchor scales aspect ratios mAP (%) is an effective solution. Using just 3 scales with 1 1282 1:1 65.8 aspect ratio (69.8%) is as good as using 3 scales with 1 scale, 1 ratio 2562 1:1 66.7 3 aspect ratios on this dataset, suggesting that scales 1282 {2:1, 1:1, 1:2} 68.8 1 scale, 3 ratios and aspect ratios are not disentangled dimensions for 2562 {2:1, 1:1, 1:2} 67.9 the detection accuracy. But we still adopt these two 3 scales, 1 ratio {128 , 2562 , 5122 } 2 1:1 69.8 dimensions in our designs to keep our system flexible. 3 scales, 3 ratios {1282 , 2562 , 5122 } {2:1, 1:1, 1:2} 69.9 In Table 9 we compare different values of λ in Equa- tion (1). By default we use λ = 10 which makes the Table 9: Detection results of Faster R-CNN on PAS- two terms in Equation (1) roughly equally weighted CAL VOC 2007 test set using different values of λ after normalization. Table 9 shows that our result is in Equation (1). The network is VGG-16. The training impacted just marginally (by ∼ 1%) when λ is within data is VOC 2007 trainval. The default setting of using a scale of about two orders of magnitude (1 to 100). λ = 10 (69.9%) is the same as that in Table 3. This demonstrates that the result is insensitive to λ in λ 0.1 1 10 100 a wide range. mAP (%) 67.2 68.9 69.9 69.1 Analysis of Recall-to-IoU. Next we compute the recall of proposals at different IoU ratios with ground- truth boxes. It is noteworthy that the Recall-to-IoU metric is just loosely [19], [20], [21] related to the In Table 5 we summarize the running time of the ultimate detection accuracy. It is more appropriate to entire object detection system. SS takes 1-2 seconds use this metric to diagnose the proposal method than depending on content (on average about 1.5s), and to evaluate it. Fast R-CNN with VGG-16 takes 320ms on 2000 SS In Figure 4, we show the results of using 300, 1000, proposals (or 223ms if using SVD on fully-connected and 2000 proposals. We compare with SS and EB, and layers [2]). Our system with VGG-16 takes in total the N proposals are the top-N ranked ones based on 198ms for both proposal and detection. With the con- the confidence generated by these methods. The plots volutional features shared, the RPN alone only takes show that the RPN method behaves gracefully when 10ms computing the additional layers. Our region- the number of proposals drops from 2000 to 300. This wise computation is also lower, thanks to fewer pro- explains why the RPN has a good ultimate detection posals (300 per image). Our system has a frame-rate mAP when using as few as 300 proposals. As we of 17 fps with the ZF net. analyzed before, this property is mainly attributed to Sensitivities to Hyper-parameters. In Table 8 we the cls term of the RPN. The recall of SS and EB drops investigate the settings of anchors. By default we use more quickly than RPN when the proposals are fewer.

10. 10 ϯϬϬ ƉƌŽƉŽƐĂůƐ ϭϬϬϬ ƉƌŽƉŽƐĂůƐ ϮϬϬϬ ƉƌŽƉŽƐĂůƐ ϭ ϭ ϭ Ϭ͘ϴ Ϭ͘ϴ Ϭ͘ϴ Ϭ͘ϲ Ϭ͘ϲ Ϭ͘ϲ ZĞĐĂůů ^^ ^^ ^^ Ϭ͘ϰ Ϭ͘ϰ Ϭ͘ϰ    Ϭ͘Ϯ ZWE & Ϭ͘Ϯ ZWE & Ϭ͘Ϯ ZWE & ZWE s'' ZWE s'' ZWE s'' Ϭ Ϭ Ϭ Ϭ͘ϱ Ϭ͘ϲ Ϭ͘ϳ Ϭ͘ϴ Ϭ͘ϵ ϭ Ϭ͘ϱ Ϭ͘ϲ Ϭ͘ϳ Ϭ͘ϴ Ϭ͘ϵ ϭ Ϭ͘ϱ Ϭ͘ϲ Ϭ͘ϳ Ϭ͘ϴ Ϭ͘ϵ ϭ /Žh /Žh /Žh Figure 4: Recall vs. IoU overlap ratio on the PASCAL VOC 2007 test set. Table 10: One-Stage Detection vs. Two-Stage Proposal + Detection. Detection results are on the PASCAL VOC 2007 test set using the ZF model and Fast R-CNN. RPN uses unshared features. proposals detector mAP (%) Two-Stage RPN + ZF, unshared 300 Fast R-CNN + ZF, 1 scale 58.7 One-Stage dense, 3 scales, 3 aspect ratios 20000 Fast R-CNN + ZF, 1 scale 53.8 One-Stage dense, 3 scales, 3 aspect ratios 20000 Fast R-CNN + ZF, 5 scales 53.9 One-Stage Detection vs. Two-Stage Proposal + De- region proposals with sliding windows leads to ∼6% tection. The OverFeat paper [9] proposes a detection degradation in both papers. We also note that the one- method that uses regressors and classifiers on sliding stage system is slower as it has considerably more windows over convolutional feature maps. OverFeat proposals to process. is a one-stage, class-specific detection pipeline, and ours is a two-stage cascade consisting of class-agnostic pro- posals and class-specific detections. In OverFeat, the 4.2 Experiments on MS COCO region-wise features come from a sliding window of We present more results on the Microsoft COCO one aspect ratio over a scale pyramid. These features object detection dataset [12]. This dataset involves 80 are used to simultaneously determine the location and object categories. We experiment with the 80k images category of objects. In RPN, the features are from on the training set, 40k images on the validation set, square (3×3) sliding windows and predict proposals and 20k images on the test-dev set. We evaluate the relative to anchors with different scales and aspect mAP averaged for IoU ∈ [0.5 : 0.05 : 0.95] (COCO’s ratios. Though both methods use sliding windows, the standard metric, simply denoted as mAP@[.5, .95]) region proposal task is only the first stage of Faster R- and mAP@0.5 (PASCAL VOC’s metric). CNN—the downstream Fast R-CNN detector attends There are a few minor changes of our system made to the proposals to refine them. In the second stage of for this dataset. We train our models on an 8-GPU our cascade, the region-wise features are adaptively implementation, and the effective mini-batch size be- pooled [1], [2] from proposal boxes that more faith- comes 8 for RPN (1 per GPU) and 16 for Fast R-CNN fully cover the features of the regions. We believe (2 per GPU). The RPN step and Fast R-CNN step are these features lead to more accurate detections. both trained for 240k iterations with a learning rate To compare the one-stage and two-stage systems, of 0.003 and then for 80k iterations with 0.0003. We we emulate the OverFeat system (and thus also circum- modify the learning rates (starting with 0.003 instead vent other differences of implementation details) by of 0.001) because the mini-batch size is changed. For one-stage Fast R-CNN. In this system, the “proposals” the anchors, we use 3 aspect ratios and 4 scales are dense sliding windows of 3 scales (128, 256, 512) (adding 642 ), mainly motivated by handling small and 3 aspect ratios (1:1, 1:2, 2:1). Fast R-CNN is objects on this dataset. In addition, in our Fast R-CNN trained to predict class-specific scores and regress box step, the negative samples are defined as those with locations from these sliding windows. Because the a maximum IoU with ground truth in the interval of OverFeat system adopts an image pyramid, we also [0, 0.5), instead of [0.1, 0.5) used in [1], [2]. We note evaluate using convolutional features extracted from that in the SPPnet system [1], the negative samples 5 scales. We use those 5 scales as in [1], [2]. in [0.1, 0.5) are used for network fine-tuning, but the Table 10 compares the two-stage system and two negative samples in [0, 0.5) are still visited in the SVM variants of the one-stage system. Using the ZF model, step with hard-negative mining. But the Fast R-CNN the one-stage system has an mAP of 53.9%. This is system [2] abandons the SVM step, so the negative lower than the two-stage system (58.7%) by 4.8%. samples in [0, 0.1) are never visited. Including these This experiment justifies the effectiveness of cascaded [0, 0.1) samples improves mAP@0.5 on the COCO region proposals and object detection. Similar obser- dataset for both Fast R-CNN and Faster R-CNN sys- vations are reported in [2], [39], where replacing SS tems (but the impact is negligible on PASCAL VOC).

11. 11 Table 11: Object detection results (%) on the MS COCO dataset. The model is VGG-16. COCO val COCO test-dev method proposals training data mAP@.5 mAP@[.5, .95] mAP@.5 mAP@[.5, .95] Fast R-CNN [2] SS, 2000 COCO train - - 35.9 19.7 Fast R-CNN [impl. in this paper] SS, 2000 COCO train 38.6 18.9 39.3 19.3 Faster R-CNN RPN, 300 COCO train 41.5 21.2 42.1 21.5 Faster R-CNN RPN, 300 COCO trainval - - 42.7 21.9 The rest of the implementation details are the same Table 12: Detection mAP (%) of Faster R-CNN on as on PASCAL VOC. In particular, we keep using PASCAL VOC 2007 test set and 2012 test set us- 300 proposals and single-scale (s = 600) testing. The ing different training data. The model is VGG-16. testing time is still about 200ms per image on the “COCO” denotes that the COCO trainval set is used COCO dataset. for training. See also Table 6 and Table 7. training data 2007 test 2012 test In Table 11 we first report the results of the Fast VOC07 69.9 67.0 R-CNN system [2] using the implementation in this VOC07+12 73.2 - paper. Our Fast R-CNN baseline has 39.3% mAP@0.5 VOC07++12 - 70.4 on the test-dev set, higher than that reported in [2]. COCO (no VOC) 76.1 73.0 We conjecture that the reason for this gap is mainly COCO+VOC07+12 78.8 - due to the definition of the negative samples and also COCO+VOC07++12 - 75.9 the changes of the mini-batch sizes. We also note that the mAP@[.5, .95] is just comparable. 4.3 From MS COCO to PASCAL VOC Next we evaluate our Faster R-CNN system. Using the COCO training set to train, Faster R-CNN has Large-scale data is of crucial importance for improv- 42.1% mAP@0.5 and 21.5% mAP@[.5, .95] on the ing deep neural networks. Next, we investigate how COCO test-dev set. This is 2.8% higher for mAP@0.5 the MS COCO dataset can help with the detection and 2.2% higher for mAP@[.5, .95] than the Fast R- performance on PASCAL VOC. CNN counterpart under the same protocol (Table 11). As a simple baseline, we directly evaluate the This indicates that RPN performs excellent for im- COCO detection model on the PASCAL VOC dataset, proving the localization accuracy at higher IoU thresh- without fine-tuning on any PASCAL VOC data. This olds. Using the COCO trainval set to train, Faster R- evaluation is possible because the categories on CNN has 42.7% mAP@0.5 and 21.9% mAP@[.5, .95] on COCO are a superset of those on PASCAL VOC. The the COCO test-dev set. Figure 6 shows some results categories that are exclusive on COCO are ignored in on the MS COCO test-dev set. this experiment, and the softmax layer is performed only on the 20 categories plus background. The mAP Faster R-CNN in ILSVRC & COCO 2015 compe- under this setting is 76.1% on the PASCAL VOC 2007 titions We have demonstrated that Faster R-CNN test set (Table 12). This result is better than that trained benefits more from better features, thanks to the fact on VOC07+12 (73.2%) by a good margin, even though that the RPN completely learns to propose regions by the PASCAL VOC data are not exploited. neural networks. This observation is still valid even Then we fine-tune the COCO detection model on when one increases the depth substantially to over the VOC dataset. In this experiment, the COCO model 100 layers [18]. Only by replacing VGG-16 with a 101- is in place of the ImageNet-pre-trained model (that layer residual net (ResNet-101) [18], the Faster R-CNN is used to initialize the network weights), and the system increases the mAP from 41.5%/21.2% (VGG- Faster R-CNN system is fine-tuned as described in 16) to 48.4%/27.2% (ResNet-101) on the COCO val Section 3.2. Doing so leads to 78.8% mAP on the set. With other improvements orthogonal to Faster R- PASCAL VOC 2007 test set. The extra data from CNN, He et al. [18] obtained a single-model result of the COCO set increases the mAP by 5.6%. Table 6 55.7%/34.9% and an ensemble result of 59.0%/37.4% shows that the model trained on COCO+VOC has on the COCO test-dev set, which won the 1st place the best AP for every individual category on PASCAL in the COCO 2015 object detection competition. The VOC 2007. Similar improvements are observed on the same system [18] also won the 1st place in the ILSVRC PASCAL VOC 2012 test set (Table 12 and Table 7). We 2015 object detection competition, surpassing the sec- note that the test-time speed of obtaining these strong ond place by absolute 8.5%. RPN is also a building results is still about 200ms per image. block of the 1st-place winning entries in ILSVRC 2015 localization and COCO 2015 segmentation competi- 5 C ONCLUSION tions, for which the details are available in [18] and We have presented RPNs for efficient and accurate [15] respectively. region proposal generation. By sharing convolutional

12. 12 person : 0.918 cow : 0.995 bird : 0.902 person : 0.988 person : 0.992 car : 0.745 .745 person : 0.797 bird : 0.978 car : 0.955 55 horse : 0.991 bird : 0.972 cow : 0.998 bird : 0.941 bottle : 0.726 person : 0.964 person : 0.988 p pers person : 0.986 86 car : 0.999 personn person : 0.993 0 993 : 0.959 person : 0.976 person : 0.929 person : 0.994 person : 0.991 car : 0.997 car : 0.980 dog : 0.981 cow : 0.979 person : 0.998 person : 0.961 cow : 0.974 person : 0.958 cow : 0.979 bus : 0.999 person : 0.960 cow : 0.892 cow : 0.985 person : 0.985 person : 0.995 person : 0.996 per person : 0.757 person : 0.994 dog : 0.697 cat : 0.998 person : 0.917 boat : 0.671 car : 1.000 boat : 0.895 boat : 0.749 boat : 0.877 person : 0.988 person : 0.995 person : 0.994 bicycle b bicyc 4person e :: 0.981 0.987 0 987 person : 0.930 person : 0.940 person 940 : 0.893 bicycle : 0.972 bicycle : 0.977 77 boat : 0.992 person : 0.962 dog : 0.987 pottedplant : 0.951 bottle : 0.851 bottle : 0 0.962 962 boat : 0.693 diningtable : 0.791 boat : 0.846 person : 0.948 person : 0.972 person : 0.919 pottedplant : 0.728 car : 1.000 car : 0.880 car : 0.981 car : 0.982 chair : 0.630 boat : 0.995 boat : 0.948 diningtable : 0.862 bottle : 0.826 boat : 0.692 boat : 0 0.808 808 person : 0.975 aeroplane : 0.992 bird : 0.998 aeroplane : 0.986 sheep : 0.970 bird : 0.980 bird : 0.806 person : 0.670 horse : 0.984 aeroplane : 0.998 pottedplant : 0.820 chair : 0.984 984 diningtable : 0.997 pottedplant : 0.993 chair : 0.978 chair : 0.962 chair : 0.976 pottedplant : 0.715 car : 0.907 907 person : 0.993 person : 0.987 pottedplant : 0.940 pottedplant : 0.869 tvmonitor : 0.945 person : 0.983 aeroplane : 0.978 bird : 0.997 tvmonitor : 0.993 chair : 0.723 person : 0.968 chair : 0.982 tvmonitor : 0.993 person : 0.959 bottle e : 0.789 person : 0.988 diningtable : 0.903 bottle : 0 bot 0.858 chair : 0.852 bottle : 0.616 b bottle :person 0 0.903 903 : 0.897 person : 0.870 bottle : 0.884 bird : 0.727 Figure 5: Selected examples of object detection results on the PASCAL VOC 2007 test set using the Faster R-CNN system. The model is VGG-16 and the training data is 07+12 trainval (73.2% mAP on the 2007 test set). Our method detects objects of a wide range of scales and aspect ratios. Each output box is associated with a category label and a softmax score in [0, 1]. A score threshold of 0.6 is used to display these images. The running time for obtaining these results is 198ms per image, including all steps. features with the down-stream detection network, the R EFERENCES region proposal step is nearly cost-free. Our method enables a unified, deep-learning-based object detec- [1] K. He, X. Zhang, S. Ren, and J. Sun, “Spatial pyramid pooling tion system to run at near real-time frame rates. The in deep convolutional networks for visual recognition,” in learned RPN also improves region proposal quality European Conference on Computer Vision (ECCV), 2014. [2] R. Girshick, “Fast R-CNN,” in IEEE International Conference on and thus the overall object detection accuracy. Computer Vision (ICCV), 2015. [3] K. Simonyan and A. Zisserman, “Very deep convolutional

13. 13 person person son on : 0 0.975 975 traffic light : 0.802 person : 0.941 .941 4 person : 0.673 person : 0.928 person nperson : 0. 0.958 958: 0.823 airplane : 0.997 person : 0.759 p person : 0.766 backpack : 0.756 person : 0.772 0person 976 : 0.939 person : 0.976 0 939 person : 0.842 0.84 person : 0.841 person : 0.867 umbrella : 0.824 person : 0.897 car : 0.957 person on : 0 0.950 50 handbag : 0.848 person : 0.805 clock : 0.986 person : 0.950 p person : 0.931 person : 0.970 clock : 0.981 person : 0.916 motorcycle : 0.713 dog : 0.996 bicycle : 0.891 dog : 0.691 bicycle : 0.639 person : 0.996 person : 0.800 motorcycle : 0.827 person : 0.808 pizza : 0.985 person : 0.998 dining table : 0.956 pizza : 0.938 bed : 0.999 pizza : 0.995 pizza : 0.982 clock : 0.982 skis : 0.919 bottle : 0.627 bowl : 0.759 giraffe : 0.989 giraffe : 0.993 giraffe : 0.988 person : 0.999 broccoli : 0.953 boat : 0.992 person : 0.934 surfboard : 0.979 umbrella : 0.885 person : 0.691 p person : 0.716 person : 0.940 person : 0.854 person : 0.927 927 person : 0.665 person : 0.692 person : 0.618 person : 0.825 5person : 0.813 person : 0.864 teddy bear : 0.999 bus : 0.999 teddy bear : 0.738 teddy bear : 0.802 potted plant : 0.769 teddy bear : 0.890 person person erson : 0.869 : 0.970 bowl : 0.602 6 sink : 0.938 sink : 0.976 sink : 0.994 toilet : 0.921 sink : 0.992 sink : 0.969 book : 0.611 tv : 0.964 bottle : 0.768 traffic light : 0.713 laptop : 0.986 traffic light : 0.869 couch : 0.627 train : 0.965 couch : 0.991 mouse : 0.871 m boat : 0.613 boat : 0.746 couch : 0.719 tv : 0.959 boat : 0.758 keyboard : 0.956 dining table : 0.637 mouse : 0.677 chair : 0.631 bench : 0.971 chair : 0.644 person : 0.986 cup : 0.720 frisbee : 0.998 person : 0.723 cup : 0.931 dining table : 0.941 cup : 0.986 bird : 0.968 dog : 0.966 bowl : 0.958 zebra : 0.996 zebra : 0.970 970 zebra : 0.848 zebra : 0.993 sandwich : 0.629 bird : 0.987 bird : 0.894 person :tv 0 : 0.711 0.792 792 person : 0.917 refrigerator : 0.699 person : 0.993 bottle : 0.982 laptop : 0.973 tennis :racket person perso 0.999 : 0.960 horse : 0.990 bird : 0.746 oven : 0.655 bird : 0.956 keyboard : 0.638 bird : 0.906 keyboard : 0.615 mouse : 0.981 dining table : 0.888 cup : 0.990 car : 0.816 toothbrush : 0.668 person : 0.984 refrigerator : 0.631 pizza : 0.919 kite : 0.934 clock : 0.988 bowl : 0.744 bowl : 0.816 bowl : 0.710 person : 0.998 bowl : 0.847 cup : 0.807 pizza : 0.965 chair : 0.772 oven : 0.969 dining table : 0.618 Figure 6: Selected examples of object detection results on the MS COCO test-dev set using the Faster R-CNN system. The model is VGG-16 and the training data is COCO trainval (42.7% mAP@0.5 on the test-dev set). Each output box is associated with a category label and a softmax score in [0, 1]. A score threshold of 0.6 is used to display these images. For each image, one color represents one object category in that image. networks for large-scale image recognition,” in International [7] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional Conference on Learning Representations (ICLR), 2015. networks for semantic segmentation,” in IEEE Conference on [4] J. R. Uijlings, K. E. van de Sande, T. Gevers, and A. W. Smeul- Computer Vision and Pattern Recognition (CVPR), 2015. ders, “Selective search for object recognition,” International [8] P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ra- Journal of Computer Vision (IJCV), 2013. manan, “Object detection with discriminatively trained part- [5] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature based models,” IEEE Transactions on Pattern Analysis and Ma- hierarchies for accurate object detection and semantic seg- chine Intelligence (TPAMI), 2010. mentation,” in IEEE Conference on Computer Vision and Pattern [9] P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, Recognition (CVPR), 2014. and Y. LeCun, “Overfeat: Integrated recognition, localization [6] C. L. Zitnick and P. Doll´ar, “Edge boxes: Locating object and detection using convolutional networks,” in International proposals from edges,” in European Conference on Computer Conference on Learning Representations (ICLR), 2014. Vision (ECCV), 2014. [10] S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards

14. 14 real-time object detection with region proposal networks,” in [36] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Neural Information Processing Systems (NIPS), 2015. Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, [11] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and and L. Fei-Fei, “ImageNet Large Scale Visual Recognition A. Zisserman, “The PASCAL Visual Object Classes Challenge Challenge,” in International Journal of Computer Vision (IJCV), 2007 (VOC2007) Results,” 2007. 2015. [12] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ra- [37] A. Krizhevsky, I. Sutskever, and G. Hinton, “Imagenet classi- manan, P. Doll´ar, and C. L. Zitnick, “Microsoft COCO: Com- fication with deep convolutional neural networks,” in Neural mon Objects in Context,” in European Conference on Computer Information Processing Systems (NIPS), 2012. Vision (ECCV), 2014. [38] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Gir- [13] S. Song and J. Xiao, “Deep sliding shapes for amodal 3d object shick, S. Guadarrama, and T. Darrell, “Caffe: Convolutional detection in rgb-d images,” arXiv:1511.02300, 2015. architecture for fast feature embedding,” arXiv:1408.5093, 2014. [14] J. Zhu, X. Chen, and A. L. Yuille, “DeePM: A deep part-based [39] K. Lenc and A. Vedaldi, “R-CNN minus R,” in British Machine model for object detection and semantic part localization,” Vision Conference (BMVC), 2015. arXiv:1511.07131, 2015. [15] J. Dai, K. He, and J. Sun, “Instance-aware semantic segmenta- tion via multi-task network cascades,” arXiv:1512.04412, 2015. [16] J. Johnson, A. Karpathy, and L. Fei-Fei, “Densecap: Fully convolutional localization networks for dense captioning,” arXiv:1511.07571, 2015. [17] D. Kislyuk, Y. Liu, D. Liu, E. Tzeng, and Y. Jing, “Human cu- ration and convnets: Powering item-to-item recommendations on pinterest,” arXiv:1511.04003, 2015. [18] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” arXiv:1512.03385, 2015. [19] J. Hosang, R. Benenson, and B. Schiele, “How good are de- tection proposals, really?” in British Machine Vision Conference (BMVC), 2014. [20] J. Hosang, R. Benenson, P. Doll´ar, and B. Schiele, “What makes for effective detection proposals?” IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2015. [21] N. Chavali, H. Agrawal, A. Mahendru, and D. Batra, “Object-Proposal Evaluation Protocol is ’Gameable’,” arXiv: 1505.05836, 2015. [22] J. Carreira and C. Sminchisescu, “CPMC: Automatic ob- ject segmentation using constrained parametric min-cuts,” IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2012. [23] P. Arbel´aez, J. Pont-Tuset, J. T. Barron, F. Marques, and J. Malik, “Multiscale combinatorial grouping,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2014. [24] B. Alexe, T. Deselaers, and V. Ferrari, “Measuring the object- ness of image windows,” IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2012. [25] C. Szegedy, A. Toshev, and D. Erhan, “Deep neural networks for object detection,” in Neural Information Processing Systems (NIPS), 2013. [26] D. Erhan, C. Szegedy, A. Toshev, and D. Anguelov, “Scalable object detection using deep neural networks,” in IEEE Confer- ence on Computer Vision and Pattern Recognition (CVPR), 2014. [27] C. Szegedy, S. Reed, D. Erhan, and D. Anguelov, “Scalable, high-quality object detection,” arXiv:1412.1441 (v1), 2015. [28] P. O. Pinheiro, R. Collobert, and P. Dollar, “Learning to segment object candidates,” in Neural Information Processing Systems (NIPS), 2015. [29] J. Dai, K. He, and J. Sun, “Convolutional feature masking for joint object and stuff segmentation,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015. [30] S. Ren, K. He, R. Girshick, X. Zhang, and J. Sun, “Ob- ject detection networks on convolutional feature maps,” arXiv:1504.06066, 2015. [31] J. K. Chorowski, D. Bahdanau, D. Serdyuk, K. Cho, and Y. Bengio, “Attention-based models for speech recognition,” in Neural Information Processing Systems (NIPS), 2015. [32] M. D. Zeiler and R. Fergus, “Visualizing and understanding convolutional neural networks,” in European Conference on Computer Vision (ECCV), 2014. [33] V. Nair and G. E. Hinton, “Rectified linear units improve restricted boltzmann machines,” in International Conference on Machine Learning (ICML), 2010. [34] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, and A. Rabinovich, “Going deeper with convo- lutions,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015. [35] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel, “Backpropagation applied to handwritten zip code recognition,” Neural computation, 1989.