R-FCN: Object Detection via Region-based Fully Convolutional Networks

We present region-based, fully convolutional networks for accurate and efficient object detection. In contrast to previous region-based detectors such as Fast/Faster R-CNN [6, 18] that apply a costly per-region subnetwork hundreds of times, our region-based detector is fully convolutional with almost all computation shared on the entire image. To achieve this goal, we propose position-sensitive score maps to address a dilemma between translation-invariance in image classification and translation-variance in object detection. Our method can thus naturally adopt fully convolutional image classifier backbones, such as the latest Residual Networks (ResNets) [9], for object detection. We show competitive results on the PASCAL VOC datasets (e.g., 83.6% mAP on the 2007 set) with the 101-layer ResNet. Meanwhile, our result is achieved at a test-time speed of 170ms per image, 2.5-20× faster than the Faster R-CNN counterpart. Code is made publicly available at:https://github.com/daijifeng001/r-fcn.

1. R-FCN: Object Detection via Region-based Fully Convolutional Networks Jifeng Dai Yi Li∗ Kaiming He Jian Sun Microsoft Research Tsinghua University Microsoft Research Microsoft Research arXiv:1605.06409v2 [cs.CV] 21 Jun 2016 Abstract We present region-based, fully convolutional networks for accurate and efficient object detection. In contrast to previous region-based detectors such as Fast/Faster R-CNN [6, 18] that apply a costly per-region subnetwork hundreds of times, our region-based detector is fully convolutional with almost all computation shared on the entire image. To achieve this goal, we propose position-sensitive score maps to address a dilemma between translation-invariance in image classification and translation-variance in object detection. Our method can thus naturally adopt fully convolutional image classifier backbones, such as the latest Residual Networks (ResNets) [9], for object detection. We show competitive results on the PASCAL VOC datasets (e.g., 83.6% mAP on the 2007 set) with the 101-layer ResNet. Meanwhile, our result is achieved at a test-time speed of 170ms per image, 2.5-20× faster than the Faster R-CNN counterpart. Code is made publicly available at: https://github.com/daijifeng001/r-fcn. 1 Introduction A prevalent family [8, 6, 18] of deep networks for object detection can be divided into two subnetworks by the Region-of-Interest (RoI) pooling layer [6]: (i) a shared, “fully convolutional” subnetwork independent of RoIs, and (ii) an RoI-wise subnetwork that does not share computation. This decomposition [8] was historically resulted from the pioneering classification architectures, such as AlexNet [10] and VGG Nets [23], that consist of two subnetworks by design — a convolutional subnetwork ending with a spatial pooling layer, followed by several fully-connected (fc) layers. Thus the (last) spatial pooling layer in image classification networks is naturally turned into the RoI pooling layer in object detection networks [8, 6, 18]. But recent state-of-the-art image classification networks such as Residual Nets (ResNets) [9] and GoogLeNets [24, 26] are by design fully convolutional2 . By analogy, it appears natural to use all convolutional layers to construct the shared, convolutional subnetwork in the object detection architecture, leaving the RoI-wise subnetwork no hidden layer. However, as empirically investigated in this work, this naïve solution turns out to have considerably inferior detection accuracy that does not match the network’s superior classification accuracy. To remedy this issue, in the ResNet paper [9] the RoI pooling layer of the Faster R-CNN detector [18] is unnaturally inserted between two sets of convolutional layers — this creates a deeper RoI-wise subnetwork that improves accuracy, at the cost of lower speed due to the unshared per-RoI computation. We argue that the aforementioned unnatural design is caused by a dilemma of increasing translation invariance for image classification vs. respecting translation variance for object detection. On one hand, the image-level classification task favors translation invariance — shift of an object inside an image should be indiscriminative. Thus, deep (fully) convolutional architectures that are as translation- invariant as possible are preferable as evidenced by the leading results on ImageNet classification ∗ This work was done when Yi Li was an intern at Microsoft Research. 2 Only the last layer is fully-connected, which is removed and replaced when fine-tuning for object detection.

2. top-left top-center bottom-right …... k k2(C+1)-d conv conv vote softmax RoI pool k C+1 image feature maps C+1 C+1 position-sensitive k2(C+1) score maps Figure 1: Key idea of R-FCN for object detection. In this illustration, there are k × k = 3 × 3 position-sensitive score maps generated by a fully convolutional network. For each of the k × k bins in an RoI, pooling is only performed on one of the k 2 maps (marked by different colors). Table 1: Methodologies of region-based detectors using ResNet-101 [9]. R-CNN [7] Faster R-CNN [19, 9] R-FCN [ours] depth of shared convolutional subnetwork 0 91 101 depth of RoI-wise subnetwork 101 10 0 [9, 24, 26]. On the other hand, the object detection task needs localization representations that are translation-variant to an extent. For example, translation of an object inside a candidate box should produce meaningful responses for describing how good the candidate box overlaps the object. We hypothesize that deeper convolutional layers in an image classification network are less sensitive to translation. To address this dilemma, the ResNet paper’s detection pipeline [9] inserts the RoI pooling layer into convolutions — this region-specific operation breaks down translation invariance, and the post-RoI convolutional layers are no longer translation-invariant when evaluated across different regions. However, this design sacrifices training and testing efficiency since it introduces a considerable number of region-wise layers (Table 1). In this paper, we develop a framework called Region-based Fully Convolutional Network (R-FCN) for object detection. Our network consists of shared, fully convolutional architectures as is the case of FCN [15]. To incorporate translation variance into FCN, we construct a set of position-sensitive score maps by using a bank of specialized convolutional layers as the FCN output. Each of these score maps encodes the position information with respect to a relative spatial position (e.g., “to the left of an object”). On top of this FCN, we append a position-sensitive RoI pooling layer that shepherds information from these score maps, with no weight (convolutional/fc) layers following. The entire architecture is learned end-to-end. All learnable layers are convolutional and shared on the entire image, yet encode spatial information required for object detection. Figure 1 illustrates the key idea and Table 1 compares the methodologies among region-based detectors. Using the 101-layer Residual Net (ResNet-101) [9] as the backbone, our R-FCN yields competitive results of 83.6% mAP on the PASCAL VOC 2007 set and 82.0% the 2012 set. Meanwhile, our results are achieved at a test-time speed of 170ms per image using ResNet-101, which is 2.5× to 20× faster than the Faster R-CNN + ResNet-101 counterpart in [9]. These experiments demonstrate that our method manages to address the dilemma between invariance/variance on translation, and fully convolu- tional image-level classifiers such as ResNets can be effectively converted to fully convolutional object detectors. Code is made publicly available at: https://github.com/daijifeng001/r-fcn. 2 Our approach Overview. Following R-CNN [7], we adopt the popular two-stage object detection strategy [7, 8, 6, 18, 1, 22] that consists of: (i) region proposal, and (ii) region classification. Although methods that do not rely on region proposal do exist (e.g., [17, 14]), region-based systems still possess leading 2

3. ZWE RoIs conv ƉĞƌͲZŽ/ conv conv RoI vote pool feature maps Figure 2: Overall architecture of R-FCN. A Region Proposal Network (RPN) [18] proposes candidate RoIs, which are then applied on the score maps. All learnable weight layers are convolutional and are computed on the entire image; the per-RoI computational cost is negligible. accuracy on several benchmarks [5, 13, 20]. We extract candidate regions by the Region Proposal Network (RPN) [18], which is a fully convolutional architecture in itself. Following [18], we share the features between RPN and R-FCN. Figure 2 shows an overview of the system. Given the proposal regions (RoIs), the R-FCN architecture is designed to classify the RoIs into object categories and background. In R-FCN, all learnable weight layers are convolutional and are computed on the entire image. The last convolutional layer produces a bank of k 2 position-sensitive score maps for each category, and thus has a k 2 (C + 1)-channel output layer with C object categories (+1 for background). The bank of k 2 score maps correspond to a k × k spatial grid describing relative positions. For example, with k × k = 3 × 3, the 9 score maps encode the cases of {top-left, top-center, top-right, ..., bottom-right} of an object category. R-FCN ends with a position-sensitive RoI pooling layer. This layer aggregates the outputs of the last convolutional layer and generates scores for each RoI. Unlike [8, 6], our position-sensitive RoI layer conducts selective pooling, and each of the k × k bin aggregates responses from only one score map out of the bank of k × k score maps. With end-to-end training, this RoI layer shepherds the last convolutional layer to learn specialized position-sensitive score maps. Figure 1 illustrates this idea. Figure 3 and 4 visualize an example. The details are introduced as follows. Backbone architecture. The incarnation of R-FCN in this paper is based on ResNet-101 [9], though other networks [10, 23] are applicable. ResNet-101 has 100 convolutional layers followed by global average pooling and a 1000-class fc layer. We remove the average pooling layer and the fc layer and only use the convolutional layers to compute feature maps. We use the ResNet-101 released by the authors of [9], pre-trained on ImageNet [20]. The last convolutional block in ResNet-101 is 2048-d, and we attach a randomly initialized 1024-d 1×1 convolutional layer for reducing dimension (to be precise, this increases the depth in Table 1 by 1). Then we apply the k 2 (C + 1)-channel convolutional layer to generate score maps, as introduced next. Position-sensitive score maps & Position-sensitive RoI pooling. To explicitly encode position information into each RoI, we divide each RoI rectangle into k × k bins by a regular grid. For an RoI rectangle of a size w × h, a bin is of a size ≈ w h k × k [8, 6]. In our method, the last convolutional layer 2 is constructed to produce k score maps for each category. Inside the (i, j)-th bin (0 ≤ i, j ≤ k − 1), we define a position-sensitive RoI pooling operation that pools only over the (i, j)-th score map: rc (i, j | Θ) = zi,j,c (x + x0 , y + y0 | Θ)/n. (1) (x,y)∈bin(i,j) Here rc (i, j) is the pooled response in the (i, j)-th bin for the c-th category, zi,j,c is one score map out of the k 2 (C + 1) score maps, (x0 , y0 ) denotes the top-left corner of an RoI, n is the number of pixels in the bin, and Θ denotes all learnable parameters of the network. The (i, j)-th bin spans iw w h h k ≤ x < (i + 1) k and j k ≤ y < (j + 1) k . The operation of Eqn.(1) is illustrated in 3

4.Figure 1, where a color represents a pair of (i, j). Eqn.(1) performs average pooling (as we use throughout this paper), but max pooling can be conducted as well. The k 2 position-sensitive scores then vote on the RoI. In this paper we simply vote by averaging the scores, producing a (C + 1)-dimensional vector for each RoI: rc (Θ) = i,j rc (i, j | Θ). Then we C compute the softmax responses across categories: sc (Θ) = erc (Θ) / c =0 erc (Θ) . They are used for evaluating the cross-entropy loss during training and for ranking the RoIs during inference. We further address bounding box regression [7, 6] in a similar way. Aside from the above k 2 (C +1)-d convolutional layer, we append a sibling 4k 2 -d convolutional layer for bounding box regression. The position-sensitive RoI pooling is performed on this bank of 4k 2 maps, producing a 4k 2 -d vector for each RoI. Then it is aggregated into a 4-d vector by average voting. This 4-d vector parameterizes a bounding box as t = (tx , ty , tw , th ) following the parameterization in [6]. We note that we perform class-agnostic bounding box regression for simplicity, but the class-specific counterpart (i.e., with a 4k 2 C-d output layer) is applicable. The concept of position-sensitive score maps is partially inspired by [3] that develops FCNs for instance-level semantic segmentation. We further introduce the position-sensitive RoI pooling layer that shepherds learning of the score maps for object detection. There is no learnable layer after the RoI layer, enabling nearly cost-free region-wise computation and speeding up both training and inference. Training. With pre-computed region proposals, it is easy to end-to-end train the R-FCN architecture. Following [6], our loss function defined on each RoI is the summation of the cross-entropy loss and the box regression loss: L(s, tx,y,w,h ) = Lcls (sc∗ ) + λ[c∗ > 0]Lreg (t, t∗ ). Here c∗ is the RoI’s ground-truth label (c∗ = 0 means background). Lcls (sc∗ ) = − log(sc∗ ) is the cross-entropy loss for classification, Lreg is the bounding box regression loss as defined in [6], and t∗ represents the ground truth box. [c∗ > 0] is an indicator which equals to 1 if the argument is true and 0 otherwise. We set the balance weight λ = 1 as in [6]. We define positive examples as the RoIs that have intersection-over-union (IoU) overlap with a ground-truth box of at least 0.5, and negative otherwise. It is easy for our method to adopt online hard example mining (OHEM) [22] during training. Our negligible per-RoI computation enables nearly cost-free example mining. Assuming N proposals per image, in the forward pass, we evaluate the loss of all N proposals. Then we sort all RoIs (positive and negative) by loss and select B RoIs that have the highest loss. Backpropagation [11] is performed based on the selected examples. Because our per-RoI computation is negligible, the forward time is nearly not affected by N , in contrast to OHEM Fast R-CNN in [22] that may double training time. We provide comprehensive timing statistics in Table 3 in the next section. We use a weight decay of 0.0005 and a momentum of 0.9. By default we use single-scale training: images are resized such that the scale (shorter side of image) is 600 pixels [6, 18]. Each GPU holds 1 image and selects B = 128 RoIs for backprop. We train the model with 8 GPUs (so the effective mini-batch size is 8×). We fine-tune R-FCN using a learning rate of 0.001 for 20k mini-batches and 0.0001 for 10k mini-batches on VOC. To have R-FCN share features with RPN (Figure 2), we adopt the 4-step alternating training3 in [18], alternating between training RPN and training R-FCN. Inference. As illustrated in Figure 2, the feature maps shared between RPN and R-FCN are computed (on an image with a single scale of 600). Then the RPN part proposes RoIs, on which the R-FCN part evaluates category-wise scores and regresses bounding boxes. During inference we evaluate 300 RoIs as in [18] for fair comparisons. The results are post-processed by non-maximum suppression (NMS) using a threshold of 0.3 IoU [7], as standard practice. À trous and stride. Our fully convolutional architecture enjoys the benefits of the network modi- fications that are widely used by FCNs for semantic segmentation [15, 2]. Particularly, we reduce ResNet-101’s effective stride from 32 pixels to 16 pixels, increasing the score map resolution. All layers before and on the conv4 stage [9] (stride=16) are unchanged; the stride=2 operations in the first conv5 block is modified to have stride=1, and all convolutional filters on the conv5 stage are modified by the “hole algorithm” [15, 2] (“Algorithme à trous” [16]) to compensate for the reduced stride. For fair comparisons, the RPN is computed on top of the conv4 stage (that are shared with 3 Although joint training [18] is applicable, it is not straightforward to perform example mining jointly. 4

5. vote yes image and RoI position-sensitive RoI-pool position-sensitive score maps Figure 3: Visualization of R-FCN (k × k = 3 × 3) for the person category. vote no image and RoI position-sensitive RoI-pool position-sensitive score maps Figure 4: Visualization when an RoI does not correctly overlap the object. R-FCN), as is the case in [9] with Faster R-CNN, so the RPN is not affected by the à trous trick. The following table shows the ablation results of R-FCN (k × k = 7 × 7, no hard example mining). The à trous trick improves mAP by 2.6 points. R-FCN with ResNet-101 on: conv4, stride=16 conv5, stride=32 conv5, à trous, stride=16 mAP (%) on VOC 07 test 72.5 74.0 76.6 Visualization. In Figure 3 and 4 we visualize the position-sensitive score maps learned by R-FCN when k × k = 3 × 3. These specialized maps are expected to be strongly activated at a specific relative position of an object. For example, the “top-center-sensitive” score map exhibits high scores roughly near the top-center position of an object. If a candidate box precisely overlaps with a true object (Figure 3), most of the k 2 bins in the RoI are strongly activated, and their voting leads to a high score. On the contrary, if a candidate box does not correctly overlaps with a true object (Figure 4), some of the k 2 bins in the RoI are not activated, and the voting score is low. 3 Related Work R-CNN [7] has demonstrated the effectiveness of using region proposals [27, 28] with deep networks. R-CNN evaluates convolutional networks on cropped and warped regions, and computation is not shared among regions (Table 1). SPPnet [8], Fast R-CNN [6], and Faster R-CNN [18] are “semi- convolutional”, in which a convolutional subnetwork performs shared computation on the entire image and another subnetwork evaluates individual regions. There have been object detectors that can be thought of as “fully convolutional” models. OverFeat [21] detects objects by sliding multi-scale windows on the shared convolutional feature maps; similarly, in 5

6.Table 2: Comparisons among fully convolutional (or “almost” fully convolutional) strategies using ResNet-101. All competitors in this table use the à trous trick. Hard example mining is not conducted. method RoI output size (k × k) mAP on VOC 07 (%) naïve Faster R-CNN 1×1 61.7 7×7 68.9 class-specific RPN - 67.6 R-FCN (w/o position-sensitivity) 1×1 fail R-FCN 3×3 75.5 7×7 76.6 Fast R-CNN [6] and [12], sliding windows that replace region proposals are investigated. In these cases, one can recast a sliding window of a single scale as a single convolutional layer. The RPN component in Faster R-CNN [18] is a fully convolutional detector that predicts bounding boxes with respect to reference boxes (anchors) of multiple sizes. The original RPN is class-agnostic in [18], but its class-specific counterpart is applicable (see also [14]) as we evaluate in the following. Another family of object detectors resort to fully-connected (fc) layers for generating holistic object detection results on an entire image, such as [25, 4, 17]. 4 Experiments 4.1 Experiments on PASCAL VOC We perform experiments on PASCAL VOC [5] that has 20 object categories. We train the models on the union set of VOC 2007 trainval and VOC 2012 trainval (“07+12”) following [6], and evaluate on VOC 2007 test set. Object detection accuracy is measured by mean Average Precision (mAP). Comparisons with Other Fully Convolutional Strategies Though fully convolutional detectors are available, experiments show that it is nontrivial for them to achieve good accuracy. We investigate the following fully convolutional strategies (or “almost” fully convolutional strategies that have only one classifier fc layer per RoI), using ResNet-101: Naïve Faster R-CNN. As discussed in the introduction, one may use all convolutional layers in ResNet-101 to compute the shared feature maps, and adopt RoI pooling after the last convolutional layer (after conv5). An inexpensive 21-class fc layer is evaluated on each RoI (so this variant is “almost” fully convolutional). The à trous trick is used for fair comparisons. Class-specific RPN. This RPN is trained following [18], except that the 2-class (object or not) convolutional classifier layer is replaced with a 21-class convolutional classifier layer. For fair comparisons, for this class-specific RPN we use ResNet-101’s conv5 layers with the à trous trick. R-FCN without position-sensitivity. By setting k = 1 we remove the position-sensitivity of the R-FCN. This is equivalent to global pooling within each RoI. Analysis. Table 2 shows the results. We note that the standard (not naïve) Faster R-CNN in the ResNet paper [9] achieves 76.4% mAP with ResNet-101 (see also Table 3), which inserts the RoI pooling layer between conv4 and conv5 [9]. As a comparison, the naïve Faster R-CNN (that applies RoI pooling after conv5) has a drastically lower mAP of 68.9% (Table 2). This comparison empirically justifies the importance of respecting spatial information by inserting RoI pooling between layers for the Faster R-CNN system. Similar observations are reported in [19]. The class-specific RPN has an mAP of 67.6% (Table 2), about 9 points lower than the standard Faster R-CNN’s 76.4%. This comparison is in line with the observations in [6, 12] — in fact, the class-specific RPN is similar to a special form of Fast R-CNN [6] that uses dense sliding windows as proposals, which shows inferior results as reported in [6, 12]. On the other hand, our R-FCN system has significantly better accuracy (Table 2). Its mAP (76.6%) is on par with the standard Faster R-CNN’s (76.4%, Table 3). These results indicate that our position- sensitive strategy manages to encode useful spatial information for locating objects, without using any learnable layer after RoI pooling. 6

7.Table 3: Comparisons between Faster R-CNN and R-FCN using ResNet-101. Timing is evaluated on a single Nvidia K40 GPU. With OHEM, N RoIs per image are computed in the forward pass, and 128 samples are selected for backpropagation. 300 RoIs are used for testing following [18]. depth of per-RoI training train time test time subnetwork w/ OHEM? (sec/img) (sec/img) mAP (%) on VOC07 Faster R-CNN 10 1.2 0.42 76.4 R-FCN 0 0.45 0.17 76.6 Faster R-CNN 10 (300 RoIs) 1.5 0.42 79.3 R-FCN 0 (300 RoIs) 0.45 0.17 79.5 Faster R-CNN 10 (2000 RoIs) 2.9 0.42 N/A R-FCN 0 (2000 RoIs) 0.46 0.17 79.3 Table 4: Comparisons on PASCAL VOC 2007 test set using ResNet-101. “Faster R-CNN +++” [9] uses iterative box regression, context, and multi-scale testing. training data mAP (%) test time (sec/img) Faster R-CNN [9] 07+12 76.4 0.42 Faster R-CNN +++ [9] 07+12+COCO 85.6 3.36 R-FCN 07+12 79.5 0.17 R-FCN multi-sc train 07+12 80.5 0.17 R-FCN multi-sc train 07+12+COCO 83.6 0.17 Table 5: Comparisons on PASCAL VOC 2012 test set using ResNet-101. “07++12” [6] denotes the union set of 07 trainval+test and 12 trainval. † : http://host.robots.ox.ac.uk:8080/anonymous/44L5HI.html ‡ : http://host.robots.ox.ac.uk:8080/anonymous/MVCM2L.html training data mAP (%) test time (sec/img) Faster R-CNN [9] 07++12 73.8 0.42 Faster R-CNN +++ [9] 07++12+COCO 83.8 3.36 R-FCN multi-sc train 07++12 77.6† 0.17 R-FCN multi-sc train 07++12+COCO 82.0‡ 0.17 The importance of position-sensitivity is further demonstrated by setting k = 1, for which R-FCN is unable to converge. In this degraded case, no spatial information can be explicitly captured within an RoI. Moreover, we report that naïve Faster R-CNN is able to converge if its RoI pooling output resolution is 1 × 1, but the mAP further drops by a large margin to 61.7% (Table 2). Comparisons with Faster R-CNN Using ResNet-101 Next we compare with standard “Faster R-CNN + ResNet-101” [9] which is the strongest competitor and the top-performer on the PASCAL VOC, MS COCO, and ImageNet benchmarks. We use k × k = 7 × 7 in the following. Table 3 shows the comparisons. Faster R-CNN evaluates a 10-layer subnetwork for each region to achieve good accuracy, but R-FCN has negligible per-region cost. With 300 RoIs at test time, Faster R-CNN takes 0.42s per image, 2.5× slower than our R-FCN that takes 0.17s per image (on a K40 GPU; this number is 0.11s on a Titan X GPU). R-FCN also trains faster than Faster R-CNN. Moreover, hard example mining [22] adds no cost to R-FCN training (Table 3). It is feasible to train R-FCN when mining from 2000 RoIs, in which case Faster R-CNN is 6× slower (2.9s vs. 0.46s). But experiments show that mining from a larger set of candidates (e.g., 2000) has no benefit (Table 3). So we use 300 RoIs for both training and inference in other parts of this paper. Table 4 shows more comparisons. Following the multi-scale training in [8], we resize the image in each training iteration such that the scale is randomly sampled from {400,500,600,700,800} pixels. We still test a single scale of 600 pixels, so add no test-time cost. The mAP is 80.5%. In addition, we train our model on the MS COCO [13] trainval set and then fine-tune it on the PASCAL VOC set. R-FCN achieves 83.6% mAP (Table 4), close to the “Faster R-CNN +++” system in [9] that uses ResNet-101 as well. We note that our competitive result is obtained at a test speed of 0.17 seconds per image, 20× faster than Faster R-CNN +++ that takes 3.36 seconds as it further incorporates iterative box regression, context, and multi-scale testing [9]. These comparisons are also observed on the PASCAL VOC 2012 test set (Table 5). 7

8.On the Impact of Depth The following table shows the R-FCN results using ResNets of different depth [9]. Our detection accuracy increases when the depth is increased from 50 to 101, but gets saturated with a depth of 152. training data test data ResNet-50 ResNet-101 ResNet-152 R-FCN 07+12 07 77.0 79.5 79.6 R-FCN multi-sc train 07+12 07 78.7 80.5 80.4 On the Impact of Region Proposals R-FCN can be easily applied with other region proposal methods, such as Selective Search (SS) [27] and Edge Boxes (EB) [28]. The following table shows the results (using ResNet-101) with different proposals. R-FCN performs competitively using SS or EB, showing the generality of our method. training data test data RPN [18] SS [27] EB [28] R-FCN 07+12 07 79.5 77.2 77.8 4.2 Experiments on MS COCO Next we evaluate on the MS COCO dataset [13] that has 80 object categories. Our experiments involve the 80k train set, 40k val set, and 20k test-dev set. We set the learning rate as 0.001 for 90k iterations and 0.0001 for next 30k iterations, with an effective mini-batch size of 8. We extend the alternating training [18] from 4-step to 5-step (i.e., stopping after one more RPN training step), which slightly improves accuracy on this dataset when the features are shared; we also report that 2-step training is sufficient to achieve comparably good accuracy but the features are not shared. The results are in Table 6. Our single-scale trained R-FCN baseline has a val result of 48.9%/27.6%. This is comparable to the Faster R-CNN baseline (48.4%/27.2%), but ours is 2.5× faster testing. It is noteworthy that our method performs better on objects of small sizes (defined by [13]). Our multi-scale trained (yet single-scale tested) R-FCN has a result of 49.1%/27.8% on the val set and 51.5%/29.2% on the test-dev set. Considering COCO’s wide range of object scales, we further evaluate a multi-scale testing variant following [9], and use testing scales of {200,400,600,800,1000}. The mAP is 53.2%/31.5%. This result is close to the 1st-place result (Faster R-CNN +++ with ResNet-101, 55.7%/34.9%) in the MS COCO 2015 competition. Nevertheless, our method is simpler and adds no bells and whistles such as context or iterative box regression that were used by [9], and is faster for both training and testing. Table 6: Comparisons on MS COCO dataset using ResNet-101. The COCO-style AP is evaluated @ IoU ∈ [0.5, 0.95]. AP@0.5 is the PASCAL-style AP evaluated @ IoU = 0.5. training test AP@0.5 AP AP AP AP test time data data small medium large (sec/img) Faster R-CNN [9] train val 48.4 27.2 6.6 28.6 45.0 0.42 R-FCN train val 48.9 27.6 8.9 30.5 42.0 0.17 R-FCN multi-sc train train val 49.1 27.8 8.8 30.8 42.2 0.17 Faster R-CNN +++ [9] trainval test-dev 55.7 34.9 15.6 38.7 50.9 3.36 R-FCN trainval test-dev 51.5 29.2 10.3 32.4 43.3 0.17 R-FCN multi-sc train trainval test-dev 51.9 29.9 10.8 32.8 45.0 0.17 R-FCN multi-sc train, test trainval test-dev 53.2 31.5 14.3 35.5 44.2 1.00 5 Conclusion and Future Work We presented Region-based Fully Convolutional Networks, a simple but accurate and efficient framework for object detection. Our system naturally adopts the state-of-the-art image classification backbones, such as ResNets, that are by design fully convolutional. Our method achieves accuracy competitive with the Faster R-CNN counterpart, but is much faster during both training and inference. We intentionally keep the R-FCN system presented in the paper simple. There have been a series of orthogonal extensions of FCNs that were developed for semantic segmentation (e.g., see [2]), as well as extensions of region-based methods for object detection (e.g., see [9, 1, 22]). We expect our system will easily enjoy the benefits of the progress in the field. 8

9. person : 0 0.982 person : 0.982 person : 1.000 car : 0.998 person : 0.997 person : 0.993 ƉĞƌƐŽŶ͗Ϭ͘ϵϵϱ ϵϱ ƉĞƌƐŽŶ͗Ϭ͘ϵϴϱ ƉĞ person : 0.993 person : 0.992 car : 0.999 person : 0.998 person : 0.964 Ŷperson ͗ Ϭ͘ϵϱϬ: 0.908 ƉĞƌƐŽŶ͗Ϭ͘ϵϱϬ person person : 0.994 person : 0.996 p pe ƉĞƌƐŽŶ͗Ϭ͘ϵϰϱ car : 0.994 ďŝĐLJĐůĞ͗Ϭ͘ϳϭϵ dog : 0.992 car : 0.999 person : 0.987 person : 0.998 dog : 0.960 car : 0.997 ďŝĐLJĐůĞ͗Ϭ͘ϵϴϭ car : 0.881 ďŝĐLJĐůĞ͗Ϭ͘ϵϰϴ ďŝĐLJĐůĞ͗Ϭ͘ϵϬϳ horse : 0.999 ďŝĐLJĐůĞ͗Ϭ͘ϵϲϳ ďŝĐLJĐůĞ͗Ϭ͘ϵϲϭ ĐĂƌ͗Ϭ͘ϵϵϱ person : 0.997 person : 1.000 person : 0.970 ďŝƌĚ͗Ϭ͘ϵϱϭ Ϭ͘ϵϱϭ bird : 0.887 bird : 0.939 sheep : 0.801 sheep : sheep 0.942 : 0.898 bird : 0.986 sheep : 0.848 person : 0.981 person : 0.891 bird : 0.721 person : 0.996 person : 0.819 bird : 0.946 bird : 0.929 bird : 0.947 .9 .947 947 person : 0.707 bird : 0.897 bird : 0.966 b bird : 0.936 ďŝƌĚ͗Ϭ͘ϵϱϭ ďŝƌĚ͗Ϭ͘ϵϱϰ bird : 0.620 person : 0.847 person : 0.978 ďŝƌĚ͗Ϭ͘ϴϱϱ bi d : 0.906 bird person : 0.999 bird : 0.920 bird : 0.973 ďŝƌĚ͗Ϭ͘ϳϵϱ ĚŝŶŝŶŐƚĂďůĞ͗Ϭ͘ϵϲϳ person : 0.990 sheep : 0.997 bird : 0.998 sheep : 0.999 ďŽƩůĞ͗Ϭ͘ϵϵϱ ďŽƩůĞ͗Ϭ͘ϵϵϱ person : 0.933 ƉĞƌƐŽŶ͗Ϭ͘ϴϱϭ ĚŝŶŝŶŐƚĂďůĞ͗Ϭ͘ϵϳϳ bird : 0.994 person : 0.980 ƉĞƌƐŽŶ͗Ϭ͘ϴϱϴ person : 0.988 tvmonitor : 0.999 person : 0.968 tvmonitor : 0.989 ƉĞƌƐŽŶ͗Ϭ͘ϵϱϵ ƉĞƌƐŽŶ͗Ϭ͘ϵϵϱ person : 0.992 bird : 0.872 bird : 0 0.990 person : 0.997 bird : 0.999 bird : 0.886 person : 0.930 person : 0.987 person : 0.999 9 999 person : 0.991 person : 00.993 person : 0.993 93 person : 0.979 horse : 0.981 horse : 0.999 horse : 0.893 person : 0.908 ďŽƩůĞ͗Ϭ͘ϵϵϲ bird : 0.988 bird : 0.624 ďŝĐLJĐůĞ͗Ϭ͘ϵϱϭ ďŝĐLJĐůĞ͗Ϭ͘ϵϱϮ ďŝĐLJĐůĞ͗Ϭ͘ϵϬϲ horse : 0.808 ď ďŽƩůĞ͗Ϭ͘ϵϵϳ bird : 0.989 horse : 0.977 bird : 0.999 ĚŝŶŝŶŐƚĂďůĞ͗Ϭ͘ϵϬϱ ďŝĐLJĐůĞ͗Ϭ͘ϵϯϵ person : 0.963 ďŽƩůĞ͗Ϭ͘ϵϴϴ ďŽƩůĞ͗ ďŽƩůĞ͗Ϭ Ϭď ďŽƩůĞ͗Ϭ͘ϵϵϯ ďŽƩůĞ ďŽƩůĞ ƩůĞ ďŽƩůĞ ďŽƩůĞ ďŽƩůĞ͗Ϭ͘ϵϵϲ ď ůĞ ϵϵϲ ϵϲ ďŽƩůĞ͗Ϭ͘ϵϵϳ ďŽƩůĞ ďŽƩůĞ ďŽƩůĞ͗Ϭ͘ϵϵϭ ϵďϵϳ ŽƩůĞ ŽƩůĞ͗Ϭ͘ϵϲϴ ƩůĞĞĞ͗Ϭ͘ϵϲϴ person ͗ Ϭ͘ϵ :ϵϲϴ 0.999 ďŽƩůĞ͗Ϭ͘ϵϯϴ ďŽƩůĞ͗ ďŽƩůĞ͗Ϭ͘ϵϯ Ϭ ϵďŽƩůĞ ďŽƩůĞ͗Ϭ͘ϵϳϰ ď ďŽƩůĞ͗Ϭ͘ϵϴϮ Ğ ͗Ϭ ϴϮƩůĞĞ͗Ϭ͘ϵ ďŽƩůĞ͗Ϭ͘ϵϰϴ Ğ Ϭ ϵ ďŽƩůĞ͗Ϭ͘ϲϳϵ motorbike : 0.999 ŵŽƚŽƌďŝŬĞ͗Ϭ͘ϵϵϱ ŵŽƚŽƌďŝŬĞ͗Ϭ͘ϵϵϱ ďŽƩůĞ͗Ϭ͘ϵϴϮ ďŽƩůĞ͗ ďŽƩůĞ Ž ͗ Ϭ͘ϵϴ ϵϴϮ ďŽƩůĞ͗Ϭ͘ϵϲϯ ďŽƩůĞ ͗Ϭ͘ϵ ďŽƩůĞ͗Ϭ͘ϴϬϴϬƩϵ ďŽƩ ϵϲϯ ϲďŽƩůĞ͗Ϭ͘ϵϰϳ ϲϯ Ğϯ͗Ϭ ďŽƩůĞ͗Ϭ͘ϵϱϭ ďϬ͘ Ϭ͘ϵ ϵϱ ϵϱ ϱϭ ͗ ďŽƩůĞ͗Ϭ͘ϵϮϴ ϭϬ ϵ ϵϰϳ ϳď ƩůĞď ďŽƩůĞ͗Ϭ͘ϵϳϴ ďŽƩ ŽƩ ͗ϬƩů ϴ ďŽƩůĞ͗ ďŽƩůĞ͗Ϭ͘ϵϳϵ ď ďŽƩůĞ͗Ϭ ƩůĞ ͗ϯďŽƩůĞ͗Ϭ͘ϵϯϮ ƩůĞ͗ ƩůĞ ďŽƩůĞ͗Ϭ͘ϵϯϯ ŽƩů Ž ŽƩůĞ Ϭ ϵϳϴ ďŽƩůĞ͗Ϭ͘ϵϱϳ person : 0.982 ĚŝŶŝŶŐƚĂďůĞ͗Ϭ͘ϲϲϭ person : 0.996 person : 0.999 ĐĂƌ͗Ϭ͘ϳϱϬ car : 0.992 ƉĞƌƐŽŶ͗Ϭ͘ϵϴϱ ĐĂƌ͗Ϭ͘ϵϵϱ ďŽƩůĞ͗Ϭ͘ϲϮϴ ďŽƩůĞ͗Ϭ͘ϳϭϱ ƩůĞĞďŽƩ ďŽƩůĞ͗Ϭ͘ϵϱϬ ď ŽƩƩ ƩůĞ ďŽƩůĞ͗Ϭ͘ϴϳϰ ďŽƩůĞ ďŽƩůĞ ͗Ϭ͘ ďŽƩůĞ͗Ϭ͘ϵϭϭ ůĞ ͗ Ϭ ϵϱϬ ůĞ͗ ůĞ ϵϱ ͘ϵϭϭ ϭ ďŽƩůĞ͗Ϭ͘ϵϴϯ ď Ž ďŽƩůĞ͗Ϭ͘ϵϮϱ ďŽƩůĞ͗Ϭ͘ϵϴϯ Ϭ͘ϵ Ϭ͘ ϴϯ ďŽƩůĞ͗Ϭ͘ϵϱϯ ďŽƩůĞ͗Ϭ͘ϳϮϴ 0car car : 0.997 997: 0.971 car : 0.993 ca car : 0.968 boat : 0.892 chair : 0.843 person : 0.996 car : 0.997 dog : 0.983 ďŽĂƚ͗Ϭ͘ϲϰϱ person : 0.999 person : 0.906 ƉŽƩĞĚƉůĂŶƚ͗Ϭ͘ϵϴϲ boat : 0.918 person : 0.986 chair : 0.992 bus : 1.000 person : 0.999 bus : 1.000 ĚŝŶŝŶŐƚĂďůĞ͗Ϭ͘ϵϰϴ chair : 0.976 bus : 0.992 person : 0.997 ƉŽƩĞĚƉůĂŶƚ͗Ϭ͘ϵϱϯ ƉŽƩĞĚƉůĂŶƚ ƉŽƩĞĚƉůĂŶƚ͗Ϭ͘ϴϰϳ ƉŽƩĞĚƉůĂŶƚ͗Ϭ͘ϲϳϭ person : 0.991 ƉŽƩĞĚƉůĂŶƚ͗Ϭ͘ϵϲϯ car : 0.961 person : 1.000 dog : 0.996 dog : 0.999 person : 0.947 ƉĞƌƐŽŶ͗Ϭ͘ϵϯϱ train : 0.998 person : 0.990 horse : 0.996 person : 0.994 person : 0.984 train : 0.998 ŚŽƌƐĞ͗Ϭ͘ϵϵϱ horse : 0.997 person : 0.997 ƉĞƌƐŽŶ͗Ϭ͘ϵϲϱ sheep : 0.998 sheep : 0.998 person : 0.910 dog : 0.993 person : 0.936 6 person : 0.983 ƉĞƌƐŽŶ͗Ϭ͘ϵϵϱ ƉĞƌƐŽŶ person : 0.822 sheep : 0.714 ƉĞƌƐŽŶ͗Ϭ͘ϲϴϱ chair : 1.000 sheep : 0.997 chair : 0.999 person : 0.998 sofa : 0.772 person : 0.977 car : 0.969 train : 0.984 dog : 0.979 train : 0.996 person : 0.999 train : 0.999 ĂĞƌŽƉůĂŶĞ͗Ϭ͘ϵϳϵ dog : 0.991 ĂĞƌŽƉůĂŶĞ͗Ϭ͘ϵϱϵ person : 0.994 person : 0.998 person : 0.999 3 car : 0.629 car : 0.933 car : 0.930 car : 0.962 cow : 0.991 cow : 0.949 car : 0.967 cow : 0.942 person : 0.994 cow : 0.630 chair : 0.844 ĐŚĂŝƌ͗Ϭ͘ϵϰϱ ĐŚĂ ƌ ƌ chair : 0.992 car : 0.980 person : 0.926 cow : 0.999 car : 0.987 chair : 0.996 car : 0.989 car : 0.983 car : 0.994 94 chair : 0.993 person : 0.971 car : 0.987 ĐŽǁ͗Ϭ͘ϵϵϱ chair : 0.996 car : 0.999 dog : 0.896 chair : 0.990 car : 0.972 chair : 1.000 car : 1.000 Figure 5: Curated examples of R-FCN results on the PASCAL VOC 2007 test set (83.6% mAP). The network is ResNet-101, and the training data is 07+12+COCO. A score threshold of 0.6 is used for displaying. The running time per image is 170ms on one Nvidia K40 GPU. References [1] S. Bell, C. L. Zitnick, K. Bala, and R. Girshick. Inside-outside net: Detecting objects in context with skip pooling and recurrent neural networks. In CVPR, 2016. [2] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille. Semantic image segmentation with deep convolutional nets and fully connected crfs. In ICLR, 2015. [3] J. Dai, K. He, Y. Li, S. Ren, and J. Sun. Instance-sensitive fully convolutional networks. arXiv:1603.08678, 2016. [4] D. Erhan, C. Szegedy, A. Toshev, and D. Anguelov. Scalable object detection using deep neural networks. In CVPR, 2014. [5] M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman. The PASCAL Visual Object Classes (VOC) Challenge. IJCV, 2010. 9

10. ƐƵƌĩŽĂƌĚ͗Ϭ͘ϵϭϭ person erson : 0.9 person 0.966 6 : 0.825 0person 825 : ƐƵƌĩŽĂƌĚ͗Ϭ͘ϳϭϴ 0 0.768 768 ĐůŽĐŬ͗Ϭ͘ϵϲϯ ƉĞƌƐŽŶ͗Ϭ͘ϵϲϭ ƉĞƌƐŽŶ͗Ϭ͘ϵϭϯ person : 0.990 ƐŚĞĞƉ͗Ϭ͘ϵϯϴ ƉĞƌƐŽŶ͗Ϭ͘ϵϰϭ person : 0.844 person : 0.999 9sheep : 0.949 sheep : 0.989 0 949 sheep : 0.986 sh ssheep : 0.992 person : 0 pe 0.984 984 sheepshee sheep : 0.977 : 0.964 0.9 person : 0.958 ĐĂƌ͗Ϭ͘ϳϮϵ ƉŽƩĞĚƉůĂŶƚ͗Ϭ͘ϴϵϬ ďŽƩůĞ͗Ϭ͘ϴϬϭ ŵŽƚŽƌĐLJĐůĞ͗Ϭ͘ϴϱϴ ƉĞƌƐŽŶ͗Ϭ͘ϵϲϯ ĚŝŶŝŶŐƚĂďůĞ͗Ϭ͘ϳϬϯ ďŽǁů͗Ϭ͘ϳϯϮ ĐŚĂŝƌ͗Ϭ͘ϳϰϭ ƐƉŽŽŶ͗Ϭ͘ϲϭϮ Ϭů͘ϲϭϮ ďŽǁů͗Ϭ͘ϵϯϳ ď ƉĞƌƐŽŶ͗Ϭ͘ϴϯϭ person : 0.652 ƉĞƌƐŽŶ͗Ϭ͘ϵϭϲ person n : 0.894 ďŽǁů͗Ϭ͘ϵϴϯ ďŽǁů͗Ϭ͘ϲϵϰ ŵŽƚŽƌĐLJĐůĞ͗Ϭ͘ϴϵϯ ŵŽƚŽƌĐLJĐů ŵ Ğ͗ Ğ cup cuup : 0 0.784 784 ĐŚĂŝƌ͗Ϭ͘ϳϬϬ ďŽƩůĞ͗Ϭ͘ϴϯϵ ŵŽƚŽƌĐLJĐůĞ͗Ϭ͘ϴϯϴ ĐĐĂƌ͗Ϭ͘ϴϱϰ ĐĂƌ͗Ϭ͘ϴϱ ĐĂƌ ͗ Ϭ ϴϱ ĐŚĂŝƌ͗Ϭ͘ϳϭϳ ŵŽƚŽƌĐLJĐůĞ͗Ϭ͘ϴϳϭ cup : 0.849 ĚŽŐ͗Ϭ͘ϲϴϰ ƐƵƌĩŽĂƌĚ͗Ϭ͘ϵϴϲ ĐĂƚ͗Ϭ͘ϵϴϵ ďŽǁů͗Ϭ͘ϵϬϳcu cup : 0.680 ƉĞƌƐŽŶ͗Ϭ͘ϵϭϵ ďŽƩůĞ͗Ϭ͘ϵϯϰ ĐŽǁ͗Ϭ͘ϴϮϯ ƉĞƌƐŽŶ͗Ϭ͘ϵϭϰ person : 0.952 ďŝĐLJĐůĞ͗Ϭ͘ϴϵϱ ĐŚĂŝƌ͗Ϭ͘ϳϰϳ ĐĂƌ͗Ϭ͘ϵϮϯ person : 0.958 per ƚƌƵĐŬ͗Ϭ͘ϳϱϯ ĐĂƌ͗Ϭ͘ϴϴϰ ĐĂ ƌ ƉĞƌƐŽŶ͗Ϭ͘ϵϵϭ ƉĞƌƐŽŶ͗Ϭ͘ϵϲϯ ĐĂƌ͗Ϭ͘ϵϮϮ ĐĂƌ͗Ϭ͘ϵϯϱ ƉĞƌƐŽŶ͗Ϭ͘ϵϵϭ Ϭ͘ϵϵϭ ĐĞůůƉŚŽŶĞ͗Ϭ͘ϵϱϲ ĐŚĂŝƌ͗Ϭ͘ϵϰϮ ĐĂƌ͗Ϭ͘ϲϵϳ ϴϯ ĐĂƌ ͗ Ϭ ĐĂƌ͗Ϭ͘ϵϳϱ ĐĂƌ͗Ϭ͘ϵϴϭ ĐĂƌ ĐĂƌ͗Ϭ ϭ person pers : 0.84 0.845 5 nƉĞƌƐŽŶ͗Ϭ͘ϴϯϴ person :0 Ŷ͗Ϭ Ŷ 87͗ Ϭ ϴϯ 0.987 987 ϴϯϴ ϴĂƌϰ͗͗ ĐĂƌ͗ ĐĂƌ͗Ϭ͘ϴϴϰ Ϭ͘ĐĂ Ϭ ĐĂ ĐĂƌ͗Ϭ͘ϵϳϲ ĐĐĂƌ͗Ϭ͘ϵϴϯ ͘ϵϴϯ ͘ϵϴ ϵϴ ĐĂƌ͗Ϭ͘ϵϳϵ ĐĂƌ͗Ϭ ŵŽƚŽƌĐLJĐůĞ͗Ϭ͘ϵϰϮ Ϭ ǁŝŶĞŐůĂƐƐ͗Ϭ͘ϵϰϰ ǁŝŶĞŐůĂƐƐ ǁŝŶĞŐůĂƐƐ͗͗Ϭ͘ϵ ĐĞůůƉŚŽŶĞ͗Ϭ͘ϴϳϰ ǁŝŶĞŐůĂƐƐ͗Ϭ͘ϲϮϰ Ő ĚŝŶŝŶŐƚĂďůĞ͗Ϭ͘ϲϵϬ person o onn : 0. 0 person 0.726 726 ϳϭϳ: 0.706 7Ŷ͗Ϭ͘ϳ ƉĞƌƐŽŶ͗Ϭ͘ϳϭϳ ĐŚĂŝƌ͗Ϭ͘ϳϰϴ ĐƵƉ͗Ϭ͘ϴϴϯ ϯ person n:0 0.726 .726 person p ersonƉ : 0.642 0 642 ƉĞƌƐŽŶ͗Ϭ͘ϳϵϭ ƉĞƌƐŽŶ ĞƌƐŽ Ŷ ͗Ϭ͘ ͗͗Ϭ͘ϳϵϭ Ϭ͘ ĐƵƉ͗Ϭ͘ϴϭϳ ĚŽŐ͗Ϭ͘ϵϳϳ ĚŝŶŝŶŐƚĂďůĞ͗Ϭ͘ϲϭϭ person p e ĚŝŶŝŶŐƚĂďůĞ͗Ϭ͘ϲϭϰ onŶŝŶ :0 0.620 620 622 20 0ůĞ ͗ Ϭ ϲ ď ďů ϲϭϰ ϭϰ ƉĞƌƐŽŶ͗Ϭ͘ϳϳϭ person pe rson on : 00.768 768 7 68 68 Ϭ ϲϭϯ person : 0 0.858 ĐŚĂŝƌ͗Ϭ͘ϲϲϮ ƉĞƌƐŽŶ͗Ϭ͘ϲϭϯ ƌ ϭŽŶ ƉĞƌƐŽŶ͗Ϭ͘ϴϳϯ ƉĞƌƐŽŶ͗ ƉĞ cup : 0.726 cup : 0.868 ĐƵƉ͗Ϭ͘ϳϭϴ ĨŽƌŬ͗Ϭ͘ϳϯϲ ĐŚĂŝƌ͗Ϭ͘ϴϲϱ ĐŚĂŝƌ͗Ϭ͘ϲϲϲ ĐŚĂŝƌ͗Ϭ͘ϵϳϵ ĐĞůůƉŚŽŶĞ͗Ϭ͘ϵϱϯ ĚŽŶƵƚ͗Ϭ͘ϵϭϴ ƉŽƩĞĚƉůĂŶƚ͗Ϭ͘ϳϴϭ ĚŽŶƵƚ͗Ϭ͘ϴϭϳ person : 0.975 person : 0.982 p person :0 ĐŚĂŝƌ͗Ϭ͘ϲϬϵ 0.959 ƉĞƌƐŽŶ͗Ϭ͘ϵϮϯ ƉĞƌƐŽŶ͗Ϭ͘ϵϱϯ ĐůŽĐŬ͗Ϭ͘ϲϵϵ horse : 0.944 ĚŽŶƵƚ͗Ϭ͘ϵϭϬ ĚŽŶƵƚ͗Ϭ͘ϲϱϭ ƵŵďƌĞůůĂ͗Ϭ͘ϴϲϳ horse : 0.955 ƵŵďƌĞůůĂ͗Ϭ͘ϴϲϵ horse : 0.860 ĚŽŶƵƚ͗Ϭ͘ϵϱϴ Ě ĚŽŶ horse : 0.667 ĚŽŶƵƚ͗Ϭ͘ϴϱϭ ĚŽŶƵƚ͗Ϭ͘ϳϰϱ ůĂƉƚŽƉ͗Ϭ͘ϵϴϯ horse : 0.979 ƉĞƌƐŽŶ͗Ϭ͘ϳϲϯ ĚŽŶƵƚ͗Ϭ͘ϵϴϳ ĚŽŶƵƚ͗Ϭ͘ϵϱϴ ĚŽŶƵƚ͗Ϭ͘ϲϴϮ ƵŵďƌĞůůĂ͗Ϭ͘ϵϳϯ person : 0.790 horse : 0.978 person : 0.978 ĚŽŶƵƚ͗Ϭ͘ϵϱϱ ŵŽƚŽƌĐLJĐůĞ͗Ϭ͘ϵϰϰ ĚŽŐ͗Ϭ͘ϳϳϬ ĚŽŶƵƚ͗Ϭ͘ϵϱϮ horse : 0.868 ĚŽŶƵƚ͗Ϭ͘ϵϲϱ person : 0.975 person : 0.947 ĚŽŶƵƚ͗Ϭ͘ϵϴϬ person : 0.775 ĚŽŶƵƚ͗Ϭ͘ϵϱϱ ĚŽŶƵƚ͗Ϭ͘ϲϬϱ ĚŽŶƵƚ͗Ϭ͘ϲϳϲ ŚĂŶĚďĂŐ͗Ϭ͘ϲϮϳ ĚŽŶƵƚ͗Ϭ͘ϴϵϴ ƐƵŝƚĐĂƐĞ͗Ϭ͘ϵϴϵ ϵϴϵ ϵ ƐƵŝƚĐĂƐĞ͗Ϭ͘ϳϭϭ person : 0.850 person : 0.650 person : 0.940 person ϴϯϰ : 0.829 ƉĞƌƐŽŶ͗Ϭ͘ϴϯϰ 0 987 : 0.757 person :person perso 0.987 0 757 ƉĞƌƐŽŶ͗Ϭ͘ϴϭϭ Ɖ ƉĞƌƐŽŶ͗Ϭ͘ϴϳϯ ƉĞƌƐŽŶ ĞƌƐŽ ŽŶ ϳϯ ƚǀ͗Ϭ͘ϵϴϰ ĐŚĂŝƌ͗Ϭ͘ϴϰϰ ĐŚĂŝƌ͗Ϭ͘ϲϱϲ ƉĞƌƐŽŶ͗Ϭ͘ϳϯϰ Ɖ ĞƌƐŽŶ ͗ Ϭ ĞƌƐŽŶ͗Ϭ person : 0.964 cup : 0.964 ďŽƩůĞ͗Ϭ͘ϴϭϬ ďŽǁů͗Ϭ͘ϵϰϰ ŚĂŶĚďĂŐ͗Ϭ͘ϳϰϵ ĚŝŶŝŶŐƚĂďůĞ͗Ϭ͘ϲϱϬ ŬŝƚĞ͗Ϭ͘ϵϲϭ mouse ouse : 0 ouse 0.699 699 9 ŬĞLJďŽĂƌĚ͗Ϭ͘ϴϴϲ ŬŝƚĞ͗Ϭ͘ϵϲϰ ŬŝƚĞ͗Ϭ͘ϵϬϭ ĨŽƌŬ͗Ϭ͘ϲϱϯ ŬŝƚĞ͗Ϭ͘ϴϵϮ ƐƉŽƌƚƐďĂůů͗Ϭ͘ϵϴϳ ŽƌĂŶŐĞ͗Ϭ͘ϳϯϴ ŬŝƚĞ͗Ϭ͘ϵϭϬ ƐĂŶĚǁŝĐŚ͗Ϭ͘ϲϴϳ ŬŝƚĞ͗Ϭ͘ϵϬϲ ŬŝƚĞ͗Ϭ͘ϴϭϮ person : 0.755 person : 0.990 ƚĞŶŶŝƐƌĂĐŬĞƚ͗Ϭ͘ϵϵϮ ůĂƉƚŽƉ͗Ϭ͘ϵϮϯ person : 0.994 person : 0.972 personƉĞƌƐŽŶ͗Ϭ͘ϵϯϴ : 0.9 pers person 8 son : 0. 0.988 0 0.974 person : 0.969 person : 0.970 Ŭ͗Ϭ͘ϴϴϵ person : 0.946 ƉĞƌƐŽŶ͗Ϭ͘ϵϬϭ person : 0.979 ƉĞƌƐŽŶ͗Ϭ͘ϵϵϭ ƟĞ͗Ϭ͘ϵϱϴ ĐŽƵĐŚ͗Ϭ͘ϲϯϲ ďƌŽĐĐŽůŝ͗Ϭ͘ϳϬϰ ďŽǁů͗Ϭ͘ϴϬϭ person : 0.969 ƉĞƌƐŽŶ͗Ϭ͘ϵϴϭ person : 0.920 ďƌŽĐĐŽůŝ͗Ϭ͘ϲϳϳ ĐĂŬĞ͗Ϭ͘ϵϴϯ ĐŚĂŝƌ͗Ϭ͘ϵϳϯ ĐĂƚ͗Ϭ͘ϴϳϱ ƉĞƌƐŽŶ͗Ϭ͘ϲϭϴ ƐŽŶ͗ ƐŽŶ͗Ϭ͘ϲϭϴ ĐĂƚ͗Ϭ͘ϵϵϬ ĐĂƚ ͗ Ϭ͘ϵϵ ŚĂŶĚďĂŐ͗Ϭ͘ϴϳϱ ĚŝŶŝŶŐƚĂďůĞ͗Ϭ͘ϵϬϬ ďƌŽĐĐŽůŝ͗Ϭ͘ϳϵϳ ďƌŽĐĐŽůŝ͗Ϭ͘ϴϳϲ person : 0.877 ĐŚĂŝƌ͗Ϭ͘ϲϬϰ ůĂƉƚŽƉ͗Ϭ͘ϵϰϴ mouse : 0.968 person : 0.950 ŚĂŶĚďĂŐ͗Ϭ͘ϳϳϱ ŚĂŶĚďĂŐ͗Ϭ͘ϳϰϱ ĐŚĂŝƌ͗Ϭ͘ϳϴϰ ďŽƩůĞ͗Ϭ͘ϴϴϴ person : 0.890 ĚŝŶŝŶŐƚĂďůĞ͗Ϭ͘ϴϴϵ ďŽƩůĞ͗Ϭ͘ϵϲϲ ĐŚĂŝƌ͗Ϭ͘ϲϰϭ ƐƵŝƚĐĂƐĞ͗Ϭ͘ϲϱϯ ďŽƩůĞ͗Ϭ͘ϵϲϭ ďƌŽĐĐŽůŝ͗Ϭ͘ϴϴϵ ďƌŽĐĐŽůŝ͗Ϭ͘ϴϵϳ bus : 0.799 ƵŵďƌĞůůĂ͗Ϭ͘ϴϳϵ Ɖ ƉĞƌƐŽŶ͗Ϭ͘ϵϯϱ person : 0.880 ĐŚĂŝƌ͗Ϭ͘ϴϱϵ person : 0.922 ĐŚĂŝƌ͗Ϭ͘ϳϲϮ person : 0.947 ĐŚĂŝƌ͗Ϭ͘ϴϲϳ ƉŽƩĞĚƉůĂŶƚ͗Ϭ͘ϲϭϱ ĐŚĂŝƌ͗Ϭ͘ϳϴϰ ŬŶŝĨĞ͗Ϭ͘ϳϮϬ ŽƌĂŶŐĞ͗Ϭ͘ϳϴϵ spoon : 0.749 ŽƌĂŶŐĞ͗Ϭ͘ϵϰϬ ďŽƩůĞ͗Ϭ͘ϵϳϰ ŽƌĂŶŐĞ͗Ϭ͘ϳϯϵ ĂƉƉůĞ͗Ϭ͘ϲϭϴ ƉŽƩĞĚƉůĂŶƚ͗Ϭ͘ϳϮϴ ďŽƩůĞ͗Ϭ͘ϴϲϰ ĐƵƉ͗Ϭ͘ϴϴϭ ĚŝŶŝŶŐƚĂďůĞ͗Ϭ͘ϳϵϬ ŽƌĂŶŐĞ͗Ϭ͘ϳϬϱ ƚƌĂĸĐůŝŐŚƚ͗Ϭ͘ϳϰϲ ƚƌĂĸĐůŝŐŚƚ͗Ϭ͘ϵϬϴ ĐĐůŝŐŚƚ ͗Ϭ͘ϵϬϴ ƚƌĂĸĐůŝŐŚƚ͗Ϭ͘ϳϬϵ ǁŝŶĞŐůĂƐƐ͗Ϭ͘ϵϰϱ ĂƉƉůĞ͗Ϭ͘ϵϮϴ ǁŝŶĞŐůĂƐƐ͗Ϭ͘ϵϱϰ ƉŝnjnjĂ͗Ϭ͘ϵϱϲ ďŽǁů͗Ϭ͘ϳϱϳ ƚƌĂĸĐůŝŐŚƚ͗Ϭ͘ϳϱϯ ĐĂƌ͗Ϭ͘ϳϴϰ Ăƌ ͗ Ϭ ϳϴϰ ƚƌƵĐŬ͗Ϭ͘ϴϭϲ ĐĂŬĞ͗Ϭ͘ϳϰϲ ĂƉƉůĞ͗Ϭ͘ϵϮϬ ďĂŶĂŶĂ͗Ϭ͘ϵϲϮ ĐĂƌ͗Ϭ͘ϵϬϭ ĐĂƌ͗Ϭ͘ϵϴϳ ĐĂƌ͗Ϭ͘ϵϰϯ ĐĂƌ ͗ Ϭ ϵϱ ĐĂƌ͗Ϭ͘ϵϱ ĐĂƌ͗Ϭ͘ϵϱϱ ĐĂƌ͗Ϭ͘ϵϱϮ spoon : 0.682 ŽƌĂŶŐĞ͗Ϭ͘ϳϳϰ ďŽǁů͗Ϭ͘ϵϳϰ ŐŝƌĂīĞ͗Ϭ͘ϵϵϱ ƚĞĚĚLJďĞĂƌ͗Ϭ͘ϵϱϰ ƐƵŝƚĐĂƐĞ͗Ϭ͘ ƐƵŝƚĐĂƐĞ ƐƵŝƚĐĂƐĞ͗Ϭ͘ϵϲϵ ͗ Ϭ ϵϲϵ ƐƵŝƚĐĂƐĞ͗Ϭ͘ϲϱϭ ǀĂƐĞ͗Ϭ͘ϳϯϱ ƌĞĨƌŝŐĞƌĂƚŽƌ͗Ϭ͘ϴϮϴ ŵŝĐƌŽǁĂǀĞ͗Ϭ͘ϵϴϱ ĐůŽĐŬ͗Ϭ͘ϳϱϱ ƐƵŝƚĐĂƐĞ͗Ϭ͘ϳϴϯ ŐŝƌĂīĞ͗Ϭ͘ϵϴϬ ĚŽŐ͗Ϭ͘ϵϮϳ ƐƵŝƚĐĂƐĞ͗Ϭ͘ϵϲϭ ƚĞĚĚLJďĞĂƌ͗Ϭ͘ϵϭϬ ĐŚĂŝƌ͗Ϭ͘ϵϲϱ ƐƵŝƚĐĂƐĞ͗Ϭ͘ϵϰϯ ŐŝƌĂīĞ͗Ϭ͘ϵϯϰ ĚŝŶŝŶŐƚĂďůĞ͗Ϭ͘ϴϮϮ ĐŚĂŝƌ͗Ϭ͘ϴ ĐŚĂŝƌ͗Ϭ͘ϴϲϰ ďŽǁů͗Ϭ͘ϵϲϰ ǁŝŶĞŐůĂƐƐ͗Ϭ͘ϵϭϬ ŽǀĞŶ͗Ϭ͘ϲϰϵ ƐƵŝƚĐĂƐĞ͗Ϭ͘ϲϵϵ ƐƵŝƚĐĂƐĞ͗Ϭ͘ϳϴϰ ďŽƩůĞ͗Ϭ͘ϵϬϳ ƉŽƩĞĚƉůĂŶƚ͗Ϭ͘ϳϮϳ Ϭ͘ϵϭϳ ďŽǁů͗Ϭ͘ϳϵϵ ďŽƩůĞ͗Ϭ͘ϵϭϳ person rƉĞƌƐŽŶ͗Ϭ͘ϳϲϭ rson : 0.960 ƉĞƌƐŽŶ͗Ϭ͘ϵϴϭ ƚĞĚĚLJďĞĂƌ͗Ϭ͘ϵϰϴ cup : 0.670 ďŽƩůĞ͗Ϭ͘ϴϱϰ ƩůĞ͗Ϭ͘ϴϱϰ Ʃ ͘ϴ ƚĞĚĚLJďĞĂƌ͗Ϭ͘ϲϵϬ ďŽƩůĞ͗Ϭ͘ϳϬϴ ƐƵŝƚĐĂƐĞ͗Ϭ͘ϵϴϳ Ğ͗Ϭ͘ϵϴϳ Ϭ ϵϴϳ ƐƵŝƚĐĂƐĞ͗Ϭ͘ϲϬϰ cup : 0.907 ĐƵƉ͗Ϭ͘ϴϯϵ ďŽǁů͗Ϭ͘ϵϬϳ ĐŚĂŝƌ͗Ϭ͘ϵϳϮ spoon : 0.948 ƚĞĚĚLJďĞĂƌ͗Ϭ͘ϴϵϰ ďŽǁů͗Ϭ͘ϵϰϴ ŬŶŝĨĞ͗Ϭ͘ϲϳϯ ďŽǁů͗Ϭ͘ϳϴϰ Figure 6: Curated examples of R-FCN results on the MS COCO test-dev set (31.5% AP). The network is ResNet-101, and the training data is MS COCO trainval. A score threshold of 0.6 is used for displaying. [6] R. Girshick. Fast R-CNN. In ICCV, 2015. [7] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In CVPR, 2014. [8] K. He, X. Zhang, S. Ren, and J. Sun. Spatial pyramid pooling in deep convolutional networks for visual recognition. In ECCV. 2014. [9] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, 2016. [10] A. Krizhevsky, I. Sutskever, and G. Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, 2012. [11] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel. Backpropa- gation applied to handwritten zip code recognition. Neural computation, 1989. [12] K. Lenc and A. Vedaldi. R-CNN minus R. In BMVC, 2015. [13] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick. Microsoft COCO: Common objects in context. In ECCV, 2014. 10

11. Table 7: Detailed detection results on the PASCAL VOC 2007 test set. method data mAP areo bike bird boat bottle bus car cat chair cow table dog horse mbike person plant sheep sofa train tv Faster R-CNN 07+12 76.4 79.8 80.7 76.2 68.3 55.9 85.1 85.3 89.8 56.7 87.8 69.4 88.3 88.9 80.9 78.4 41.7 78.6 79.8 85.3 72.0 Faster R-CNN+++ 07+12+CO 85.6 90.0 89.6 87.8 80.8 76.1 89.9 89.9 89.6 75.5 90.0 80.7 89.6 90.3 89.1 88.7 65.4 88.1 85.6 89.0 86.8 R-FCN 07+12 79.5 82.5 83.7 80.3 69.0 69.2 87.5 88.4 88.4 65.4 87.3 72.1 87.9 88.3 81.3 79.8 54.1 79.6 78.8 87.1 79.5 R-FCN ms train 07+12 80.5 79.9 87.2 81.5 72.0 69.8 86.8 88.5 89.8 67.0 88.1 74.5 89.8 90.6 79.9 81.2 53.7 81.8 81.5 85.9 79.9 R-FCN ms train 07+12+CO 83.6 88.1 88.4 81.5 76.2 73.8 88.7 89.7 89.6 71.1 89.9 76.6 90.0 90.4 88.7 86.6 59.7 87.4 84.1 88.7 82.4 Table 8: Detailed detection results on the PASCAL VOC 2012 test set. † : http://host.robots.ox.ac.uk: ‡ 8080/anonymous/44L5HI.html : http://host.robots.ox.ac.uk:8080/anonymous/MVCM2L.html method data mAP areo bike bird boat bottle bus car cat chair cow table dog horse mbike person plant sheep sofa train tv Faster R-CNN 07++12 73.8 86.5 81.6 77.2 58.0 51.0 78.6 76.6 93.2 48.6 80.4 59.0 92.1 85.3 84.8 80.7 48.1 77.3 66.5 84.7 65.6 Faster R-CNN+++ 07++12+CO 83.8 92.1 88.4 84.8 75.9 71.4 86.3 87.8 94.2 66.8 89.4 69.2 93.9 91.9 90.9 89.6 67.9 88.2 76.8 90.3 80.0 R-FCN ms train† 07++12 77.6 86.9 83.4 81.5 63.8 62.4 81.6 81.1 93.1 58.0 83.8 60.8 92.7 86.0 84.6 84.4 59.0 80.8 68.6 86.1 72.9 R-FCN ms train‡ 07++12+CO 82.0 89.5 88.3 83.5 70.8 70.7 85.5 86.3 94.2 64.7 87.6 65.8 92.7 90.5 89.4 87.8 65.6 85.6 74.5 88.9 77.4 [14] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, and S. Reed. SSD: Single shot multibox detector. arXiv:1512.02325v2, 2015. [15] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In CVPR, 2015. [16] S. Mallat. A wavelet tour of signal processing. Academic press, 1999. [17] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi. You only look once: Unified, real-time object detection. In CVPR, 2016. [18] S. Ren, K. He, R. Girshick, and J. Sun. Faster R-CNN: Towards real-time object detection with region proposal networks. In NIPS, 2015. [19] S. Ren, K. He, R. Girshick, X. Zhang, and J. Sun. Object detection networks on convolutional feature maps. arXiv:1504.06066, 2015. [20] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei. ImageNet Large Scale Visual Recognition Challenge. IJCV, 2015. [21] P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, and Y. LeCun. Overfeat: Integrated recognition, localization and detection using convolutional networks. In ICLR, 2014. [22] A. Shrivastava, A. Gupta, and R. Girshick. Training region-based object detectors with online hard example mining. In CVPR, 2016. [23] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In ICLR, 2015. [24] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, and A. Rabinovich. Going deeper with convolutions. In CVPR, 2015. [25] C. Szegedy, A. Toshev, and D. Erhan. Deep neural networks for object detection. In NIPS, 2013. [26] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna. Rethinking the inception architecture for computer vision. In CVPR, 2016. [27] J. R. Uijlings, K. E. van de Sande, T. Gevers, and A. W. Smeulders. Selective search for object recognition. IJCV, 2013. [28] C. L. Zitnick and P. Dollár. Edge boxes: Locating object proposals from edges. In ECCV, 2014. 11