Fully Convolutional Networks for Semantic Segmentation

Convolutional networks are powerful visual models that yield hierarchies of features. We show that convolutional networks by themselves, trained end-to-end, pixels-to-pixels, exceed the state-of-the-art in semantic segmentation. Our key insight is to build “fully convolutional”networks that take input of arbitrary size and producecorrespondingly-sized output with efficient inference and learning. We define and detail the space of fully convolutional networks, explain their application to spatially denseprediction tasks, and draw connections to prior models. We adapt contemporary classification networks (AlexNet [19],the VGG net [31], and GoogLeNet [32]) into fully convolutional networks and transfer their learned representationsby fine-tuning [4] to the segmentation task. We then de-fine a novel architecture that combines semantic information from a deep, coarse layer with appearance informationfrom a shallow, fine layer to produce accurate and detailed segmentations. Our fully convolutional network achievesstate-of-the-art segmentation of PASCAL VOC (20% relative improvement to 62.2% mean IU on 2012), NYUDv2,and SIFT Flow, while inference takes less than one fifth of asecond for a typical imag
展开查看详情

1. Fully Convolutional Networks for Semantic Segmentation Jonathan Long∗ Evan Shelhamer∗ Trevor Darrell UC Berkeley {jonlong,shelhamer,trevor}@cs.berkeley.edu ion . Abstract g.t arXiv:1411.4038v2 [cs.CV] 8 Mar 2015 forward/inference ict red t ion p a ise nt elw me backward/learning eg Convolutional networks are powerful visual models that pix s yield hierarchies of features. We show that convolu- tional networks by themselves, trained end-to-end, pixels- 96 096 21 to-pixels, exceed the state-of-the-art in semantic segmen- 38 4 38 4 25 6 40 4 56 tation. Our key insight is to build “fully convolutional” 2 96 networks that take input of arbitrary size and produce correspondingly-sized output with efficient inference and 21 learning. We define and detail the space of fully convolu- tional networks, explain their application to spatially dense Figure 1. Fully convolutional networks can efficiently learn to make dense predictions for per-pixel tasks like semantic segmen- prediction tasks, and draw connections to prior models. We tation. adapt contemporary classification networks (AlexNet [19], the VGG net [31], and GoogLeNet [32]) into fully convolu- tional networks and transfer their learned representations We show that a fully convolutional network (FCN), by fine-tuning [4] to the segmentation task. We then de- trained end-to-end, pixels-to-pixels on semantic segmen- fine a novel architecture that combines semantic informa- tation exceeds the state-of-the-art without further machin- tion from a deep, coarse layer with appearance information ery. To our knowledge, this is the first work to train FCNs from a shallow, fine layer to produce accurate and detailed end-to-end (1) for pixelwise prediction and (2) from super- segmentations. Our fully convolutional network achieves vised pre-training. Fully convolutional versions of existing state-of-the-art segmentation of PASCAL VOC (20% rela- networks predict dense outputs from arbitrary-sized inputs. tive improvement to 62.2% mean IU on 2012), NYUDv2, Both learning and inference are performed whole-image-at- and SIFT Flow, while inference takes less than one fifth of a a-time by dense feedforward computation and backpropa- second for a typical image. gation. In-network upsampling layers enable pixelwise pre- diction and learning in nets with subsampled pooling. This method is efficient, both asymptotically and abso- 1. Introduction lutely, and precludes the need for the complications in other Convolutional networks are driving advances in recog- works. Patchwise training is common [27, 2, 8, 28, 11], but nition. Convnets are not only improving for whole-image lacks the efficiency of fully convolutional training. Our ap- classification [19, 31, 32], but also making progress on lo- proach does not make use of pre- and post-processing com- cal tasks with structured output. These include advances in plications, including superpixels [8, 16], proposals [16, 14], bounding box object detection [29, 12, 17], part and key- or post-hoc refinement by random fields or local classifiers point prediction [39, 24], and local correspondence [24, 9]. [8, 16]. Our model transfers recent success in classifica- The natural next step in the progression from coarse to tion [19, 31, 32] to dense prediction by reinterpreting clas- fine inference is to make a prediction at every pixel. Prior sification nets as fully convolutional and fine-tuning from approaches have used convnets for semantic segmentation their learned representations. In contrast, previous works [27, 2, 8, 28, 16, 14, 11], in which each pixel is labeled with have applied small convnets without supervised pre-training the class of its enclosing object or region, but with short- [8, 28, 27]. comings that this work addresses. Semantic segmentation faces an inherent tension be- tween semantics and location: global information resolves ∗ Authors contributed equally what while local information resolves where. Deep feature 1

2.hierarchies jointly encode location and semantics in a local- [8], and Pinheiro and Collobert [28]; boundary prediction to-global pyramid. We define a novel “skip” architecture for electron microscopy by Ciresan et al. [2] and for natural to combine deep, coarse, semantic information and shallow, images by a hybrid neural net/nearest neighbor model by fine, appearance information in Section 4.2 (see Figure 3). Ganin and Lempitsky [11]; and image restoration and depth In the next section, we review related work on deep clas- estimation by Eigen et al. [5, 6]. Common elements of these sification nets, FCNs, and recent approaches to semantic approaches include segmentation using convnets. The following sections ex- • small models restricting capacity and receptive fields; plain FCN design and dense prediction tradeoffs, introduce • patchwise training [27, 2, 8, 28, 11]; our architecture with in-network upsampling and multi- • post-processing by superpixel projection, random field layer combinations, and describe our experimental frame- regularization, filtering, or local classification [8, 2, work. Finally, we demonstrate state-of-the-art results on 11]; PASCAL VOC 2011-2, NYUDv2, and SIFT Flow. • input shifting and output interlacing for dense output [28, 11] as introduced by OverFeat [29]; 2. Related work • multi-scale pyramid processing [8, 28, 11]; Our approach draws on recent successes of deep nets • saturating tanh nonlinearities [8, 5, 28]; and for image classification [19, 31, 32] and transfer learning • ensembles [2, 11], [4, 38]. Transfer was first demonstrated on various visual whereas our method does without this machinery. However, recognition tasks [4, 38], then on detection, and on both we do study patchwise training 3.4 and “shift-and-stitch” instance and semantic segmentation in hybrid proposal- dense output 3.2 from the perspective of FCNs. We also classifier models [12, 16, 14]. We now re-architect and fine- discuss in-network upsampling 3.3, of which the fully con- tune classification nets to direct, dense prediction of seman- nected prediction by Eigen et al. [6] is a special case. tic segmentation. We chart the space of FCNs and situate Unlike these existing methods, we adapt and extend deep prior models, both historical and recent, in this framework. classification architectures, using image classification as su- Fully convolutional networks To our knowledge, the pervised pre-training, and fine-tune fully convolutionally to idea of extending a convnet to arbitrary-sized inputs first learn simply and efficiently from whole image inputs and appeared in Matan et al. [25], which extended the classic whole image ground thruths. LeNet [21] to recognize strings of digits. Because their net Hariharan et al. [16] and Gupta et al. [14] likewise adapt was limited to one-dimensional input strings, Matan et al. deep classification nets to semantic segmentation, but do used Viterbi decoding to obtain their outputs. Wolf and Platt so in hybrid proposal-classifier models. These approaches [37] expand convnet outputs to 2-dimensional maps of de- fine-tune an R-CNN system [12] by sampling bounding tection scores for the four corners of postal address blocks. boxes and/or region proposals for detection, semantic seg- Both of these historical works do inference and learning mentation, and instance segmentation. Neither method is fully convolutionally for detection. Ning et al. [27] define learned end-to-end. a convnet for coarse multiclass segmentation of C. elegans They achieve state-of-the-art results on PASCAL VOC tissues with fully convolutional inference. segmentation and NYUDv2 segmentation respectively, so Fully convolutional computation has also been exploited we directly compare our standalone, end-to-end FCN to in the present era of many-layered nets. Sliding window their semantic segmentation results in Section 5. detection by Sermanet et al. [29], semantic segmentation by Pinheiro and Collobert [28], and image restoration by Eigen et al. [5] do fully convolutional inference. Fully con- 3. Fully convolutional networks volutional training is rare, but used effectively by Tompson et al. [35] to learn an end-to-end part detector and spatial Each layer of data in a convnet is a three-dimensional model for pose estimation, although they do not exposit on array of size h × w × d, where h and w are spatial dimen- or analyze this method. sions, and d is the feature or channel dimension. The first Alternatively, He et al. [17] discard the non- layer is the image, with pixel size h × w, and d color chan- convolutional portion of classification nets to make a nels. Locations in higher layers correspond to the locations feature extractor. They combine proposals and spatial in the image they are path-connected to, which are called pyramid pooling to yield a localized, fixed-length feature their receptive fields. for classification. While fast and effective, this hybrid Convnets are built on translation invariance. Their ba- model cannot be learned end-to-end. sic components (convolution, pooling, and activation func- Dense prediction with convnets Several recent works tions) operate on local input regions, and depend only on have applied convnets to dense prediction problems, includ- relative spatial coordinates. Writing xij for the data vector ing semantic segmentation by Ning et al. [27], Farabet et al. at location (i, j) in a particular layer, and yij for the follow-

3.ing layer, these functions compute outputs yij by ``tabby cat" 4 4 6 96 96 00 yij = fks ({xsi+δi,sj+δj }0≤δi,δj≤k ) 25 6 38 38 25 40 40 10 96 where k is called the kernel size, s is the stride or subsam- convolutionalization pling factor, and fks determines the layer type: a matrix tabby cat heatmap multiplication for convolution or average pooling, a spatial max for max pooling, or an elementwise nonlinearity for an 96 96 00 activation function, and so on for other types of layers. 4 4 6 40 40 10 38 38 25 6 This functional form is maintained under composition, 25 with kernel size and stride obeying the transformation rule 96 Figure 2. Transforming fully connected layers into convolution fks ◦ gk s = (f ◦ g)k +(k−1)s ,ss . layers enables a classification net to output a heatmap. Adding layers and a spatial loss (as in Figure 1) produces an efficient ma- While a general deep net computes a general nonlinear chine for end-to-end dense learning. function, a net with only layers of this form computes a nonlinear filter, which we call a deep filter or fully convolu- tional network. An FCN naturally operates on an input of is illustrated in Figure 2. (By contrast, nonconvolutional any size, and produces an output of corresponding (possibly nets, such as the one by Le et al. [20], lack this capability.) resampled) spatial dimensions. A real-valued loss function composed with an FCN de- Furthermore, while the resulting maps are equivalent to fines a task. If the loss function is a sum over the spatial the evaluation of the original net on particular input patches, dimensions of the final layer, (x; θ) = ij (xij ; θ), its the computation is highly amortized over the overlapping gradient will be a sum over the gradients of each of its spa- regions of those patches. For example, while AlexNet takes tial components. Thus stochastic gradient descent on com- 1.2 ms (on a typical GPU) to produce the classification puted on whole images will be the same as stochastic gradi- scores of a 227 × 227 image, the fully convolutional ver- ent descent on , taking all of the final layer receptive fields sion takes 22 ms to produce a 10 × 10 grid of outputs from as a minibatch. a 500 × 500 image, which is more than 5 times faster than When these receptive fields overlap significantly, both the na¨ıve approach1 . feedforward computation and backpropagation are much The spatial output maps of these convolutionalized mod- more efficient when computed layer-by-layer over an entire els make them a natural choice for dense problems like se- image instead of independently patch-by-patch. mantic segmentation. With ground truth available at ev- We next explain how to convert classification nets into ery output cell, both the forward and backward passes are fully convolutional nets that produce coarse output maps. straightforward, and both take advantage of the inherent For pixelwise prediction, we need to connect these coarse computational efficiency (and aggressive optimization) of outputs back to the pixels. Section 3.2 describes a trick that convolution. OverFeat [29] introduced for this purpose. We gain insight The corresponding backward times for the AlexNet ex- into this trick by reinterpreting it as an equivalent network ample are 2.4 ms for a single image and 37 ms for a fully modification. As an efficient, effective alternative, we in- convolutional 10 × 10 output map, resulting in a speedup troduce deconvolution layers for upsampling in Section 3.3. similar to that of the forward pass. This dense backpropa- In Section 3.4 we consider training by patchwise sampling, gation is illustrated in Figure 1. and give evidence in Section 4.3 that our whole image train- ing is faster and equally effective. While our reinterpretation of classification nets as fully convolutional yields output maps for inputs of any size, the 3.1. Adapting classifiers for dense prediction output dimensions are typically reduced by subsampling. The classification nets subsample to keep filters small and Typical recognition nets, including LeNet [21], AlexNet computational requirements reasonable. This coarsens the [19], and its deeper successors [31, 32], ostensibly take output of a fully convolutional version of these nets, reduc- fixed-sized inputs and produce nonspatial outputs. The fully ing it from the size of the input by a factor equal to the pixel connected layers of these nets have fixed dimensions and stride of the receptive fields of the output units. throw away spatial coordinates. However, these fully con- nected layers can also be viewed as convolutions with ker- nels that cover their entire input regions. Doing so casts 1 Assuming efficient batching of single image inputs. The classification them into fully convolutional networks that take input of scores for a single image by itself take 5.4 ms to produce, which is nearly any size and output classification maps. This transformation 25 times slower than the fully convolutional version.

4.3.2. Shift-and-stitch is filter rarefaction Thus upsampling is performed in-network for end-to-end learning by backpropagation from the pixelwise loss. Input shifting and output interlacing is a trick that yields Note that the deconvolution filter in such a layer need not dense predictions from coarse outputs without interpola- be fixed (e.g., to bilinear upsampling), but can be learned. tion, introduced by OverFeat [29]. If the outputs are down- A stack of deconvolution layers and activation functions can sampled by a factor of f , the input is shifted (by left and top even learn a nonlinear upsampling. padding) x pixels to the right and y pixels down, once for In our experiments, we find that in-network upsampling every value of (x, y) ∈ {0, . . . , f − 1} × {0, . . . , f − 1}. is fast and effective for learning dense prediction. Our best These f 2 inputs are each run through the convnet, and the segmentation architecture uses these layers to learn to up- outputs are interlaced so that the predictions correspond to sample for refined prediction in Section 4.2. the pixels at the centers of their receptive fields. Changing only the filters and layer strides of a convnet 3.4. Patchwise training is loss sampling can produce the same output as this shift-and-stitch trick. Consider a layer (convolution or pooling) with input stride In stochastic optimization, gradient computation is s, and a following convolution layer with filter weights fij driven by the training distribution. Both patchwise train- (eliding the feature dimensions, irrelevant here). Setting the ing and fully-convolutional training can be made to pro- lower layer’s input stride to 1 upsamples its output by a fac- duce any distribution, although their relative computational tor of s, just like shift-and-stitch. However, convolving the efficiency depends on overlap and minibatch size. Whole original filter with the upsampled output does not produce image fully convolutional training is identical to patchwise the same result as the trick, because the original filter only training where each batch consists of all the receptive fields sees a reduced portion of its (now upsampled) input. To of the units below the loss for an image (or collection of reproduce the trick, rarefy the filter by enlarging it as images). While this is more efficient than uniform sampling of patches, it reduces the number of possible batches. How- fi/s,j/s if s divides both i and j; ever, random selection of patches within an image may be fij = recovered simply. Restricting the loss to a randomly sam- 0 otherwise, pled subset of its spatial terms (or, equivalently applying a (with i and j zero-based). Reproducing the full net output DropConnect mask [36] between the output and the loss) of the trick involves repeating this filter enlargement layer- excludes patches from the gradient computation. by-layer until all subsampling is removed. If the kept patches still have significant overlap, fully Simply decreasing subsampling within a net is a tradeoff: convolutional computation will still speed up training. If the filters see finer information, but have smaller receptive gradients are accumulated over multiple backward passes, fields and take longer to compute. We have seen that the batches can include patches from several images.2 shift-and-stitch trick is another kind of tradeoff: the output Sampling in patchwise training can correct class imbal- is made denser without decreasing the receptive field sizes ance [27, 8, 2] and mitigate the spatial correlation of dense of the filters, but the filters are prohibited from accessing patches [28, 16]. In fully convolutional training, class bal- information at a finer scale than their original design. ance can also be achieved by weighting the loss, and loss Although we have done preliminary experiments with sampling can be used to address spatial correlation. shift-and-stitch, we do not use it in our model. We find We explore training with sampling in Section 4.3, and do learning through upsampling, as described in the next sec- not find that it yields faster or better convergence for dense tion, to be more effective and efficient, especially when prediction. Whole image training is effective and efficient. combined with the skip layer fusion described later on. 4. Segmentation Architecture 3.3. Upsampling is backwards strided convolution We cast ILSVRC classifiers into FCNs and augment Another way to connect coarse outputs to dense pixels them for dense prediction with in-network upsampling and is interpolation. For instance, simple bilinear interpolation a pixelwise loss. We train for segmentation by fine-tuning. computes each output yij from the nearest four inputs by a Next, we build a novel skip architecture that combines linear map that depends only on the relative positions of the coarse, semantic and local, appearance information to re- input and output cells. fine prediction. In a sense, upsampling with factor f is convolution with For this investigation, we train and validate on the PAS- a fractional input stride of 1/f . So long as f is integral, a CAL VOC 2011 segmentation challenge [7]. We train with natural way to upsample is therefore backwards convolution 2 Note that not every possible patch is included this way, since the re- (sometimes called deconvolution) with an output stride of ceptive fields of the final layer units lie on a fixed, strided grid. However, f . Such an operation is trivial to implement, since it simply by shifting the image left and down by a random value up to the stride, reverses the forward and backward passes of convolution. random selection from all possible patches may be recovered.

5.a per-pixel multinomial logistic loss and validate with the Table 1. We adapt and extend three classification convnets to seg- mentation. We compare performance by mean intersection over standard metric of mean pixel intersection over union, with union on the validation set of PASCAL VOC 2011 and by infer- the mean taken over all classes, including background. The ence time (averaged over 20 trials for a 500 × 500 input on an training ignores pixels that are masked out (as ambiguous NVIDIA Tesla K40c). We detail the architecture of the adapted or difficult) in the ground truth. nets as regards dense prediction: number of parameter layers, re- ceptive field size of output units, and the coarsest stride within the 4.1. From classifier to dense FCN net. (These numbers give the best performance obtained at a fixed We begin by convolutionalizing proven classification ar- learning rate, not best performance possible.) FCN- FCN- FCN- chitectures as in Section 3. We consider the AlexNet3 ar- AlexNet VGG16 GoogLeNet4 chitecture [19] that won ILSVRC12, as well as the VGG mean IU 39.8 56.0 42.5 nets [31] and the GoogLeNet4 [32] which did exception- forward time 50 ms 210 ms 59 ms ally well in ILSVRC14. We pick the VGG 16-layer net5 , conv. layers 8 16 22 which we found to be equivalent to the 19-layer net on this parameters 57M 134M 6M task. For GoogLeNet, we use only the final loss layer, and rf size 355 404 907 improve performance by discarding the final average pool- max stride 32 32 32 ing layer. We decapitate each net by discarding the final classifier layer, and convert all fully connected layers to convolutions. We append a 1 × 1 convolution with chan- turns a line topology into a DAG, with edges that skip ahead nel dimension 21 to predict scores for each of the PAS- from lower layers to higher ones (Figure 3). As they see CAL classes (including background) at each of the coarse fewer pixels, the finer scale predictions should need fewer output locations, followed by a deconvolution layer to bi- layers, so it makes sense to make them from shallower net linearly upsample the coarse outputs to pixel-dense outputs outputs. Combining fine layers and coarse layers lets the as described in Section 3.3. Table 1 compares the prelim- model make local predictions that respect global structure. inary validation results along with the basic characteristics By analogy to the multiscale local jet of Florack et al. [10], of each net. We report the best results achieved after con- we call our nonlinear local feature hierarchy the deep jet. vergence at a fixed learning rate (at least 175 epochs). We first divide the output stride in half by predicting Fine-tuning from classification to segmentation gave rea- from a 16 pixel stride layer. We add a 1 × 1 convolution sonable predictions for each net. Even the worst model layer on top of pool4 to produce additional class predic- achieved ∼ 75% of state-of-the-art performance. The tions. We fuse this output with the predictions computed segmentation-equippped VGG net (FCN-VGG16) already on top of conv7 (convolutionalized fc7) at stride 32 by appears to be state-of-the-art at 56.0 mean IU on val, com- adding a 2× upsampling layer and summing6 both predic- pared to 52.6 on test [16]. Training on extra data raises tions. (See Figure 3). We initialize the 2× upsampling to performance to 59.4 mean IU on a subset of val7 . Training bilinear interpolation, but allow the parameters to be learned details are given in Section 4.3. as described in Section 3.3. Finally, the stride 16 predictions Despite similar classification accuracy, our implementa- are upsampled back to the image. We call this net FCN-16s. tion of GoogLeNet did not match this segmentation result. FCN-16s is learned end-to-end, initialized with the param- 4.2. Combining what and where eters of the last, coarser net, which we now call FCN-32s. The new parameters acting on pool4 are zero-initialized so We define a new fully convolutional net (FCN) for seg- that the net starts with unmodified predictions. The learning mentation that combines layers of the feature hierarchy and rate is decreased by a factor of 100. refines the spatial precision of the output. See Figure 3. Learning this skip net improves performance on the val- While fully convolutionalized classifiers can be fine- idation set by 3.0 mean IU to 62.4. Figure 4 shows im- tuned to segmentation as shown in 4.1, and even score provement in the fine structure of the output. We compared highly on the standard metric, their output is dissatisfyingly this fusion with learning only from the pool4 layer (which coarse (see Figure 4). The 32 pixel stride at the final predic- resulted in poor performance), and simply decreasing the tion layer limits the scale of detail in the upsampled output. learning rate without adding the extra link (which results We address this by adding links that combine the final in an insignificant performance improvement, without im- prediction layer with lower layers with finer strides. This proving the quality of the output). 3 Using We continue in this fashion by fusing predictions from the publicly available CaffeNet reference model. 4 Since there is no publicly available version of GoogLeNet, we use pool3 with a 2× upsampling of predictions fused from our own reimplementation. Our version is trained with less extensive data pool4 and conv7, building the net FCN-8s. We obtain augmentation, and gets 68.5% top-1 and 88.4% top-5 ILSVRC accuracy. 5 Using the publicly available version from the Caffe model zoo. 6 Max fusion made learning difficult due to gradient switching.

6. FCN-32s FCN-16s FCN-8s Ground truth 14 × 14 in order to maintain its receptive field size. In addi- tion to their computational cost, we had difficulty learning such large filters. We made an attempt to re-architect the layers above pool5 with smaller filters, but were not suc- cessful in achieving comparable performance; one possible explanation is that the initialization from ImageNet-trained weights in the upper layers is important. Another way to obtain finer predictions is to use the shift- Figure 4. Refining fully convolutional nets by fusing information and-stitch trick described in Section 3.2. In limited exper- from layers with different strides improves segmentation detail. iments, we found the cost to improvement ratio from this The first three images show the output from our 32, 16, and 8 method to be worse than layer fusion. pixel stride nets (see Figure 3). 4.3. Experimental framework Table 2. Comparison of skip FCNs on a subset of PASCAL VOC2011 validation7 . Learning is end-to-end, except for FCN- Optimization We train by SGD with momentum. We 32s-fixed, where only the last layer is fine-tuned. Note that FCN- use a minibatch size of 20 images and fixed learning rates of 32s is FCN-VGG16, renamed to highlight stride. 10−3 , 10−4 , and 5−5 for FCN-AlexNet, FCN-VGG16, and pixel mean mean f.w. FCN-GoogLeNet, respectively, chosen by line search. We acc. acc. IU IU use momentum 0.9, weight decay of 5−4 or 2−4 , and dou- FCN-32s-fixed 83.0 59.7 45.4 72.0 bled the learning rate for biases, although we found training FCN-32s 89.1 73.3 59.4 81.4 to be insensitive to these parameters (but sensitive to the FCN-16s 90.0 75.7 62.4 83.0 learning rate). We zero-initialize the class scoring convo- FCN-8s 90.3 75.9 62.7 83.2 lution layer, finding random initialization to yield neither better performance nor faster convergence. Dropout was in- cluded where used in the original classifier nets. a minor additional improvement to 62.7 mean IU, and find Fine-tuning We fine-tune all layers by back- a slight improvement in the smoothness and detail of our propagation through the whole net. Fine-tuning the output. At this point our fusion improvements have met di- output classifier alone yields only 70% of the full fine- minishing returns, both with respect to the IU metric which tuning performance as compared in Table 2. Training from emphasizes large-scale correctness, and also in terms of the scratch is not feasible considering the time required to improvement visible e.g. in Figure 4, so we do not continue learn the base classification nets. (Note that the VGG net is fusing even lower layers. trained in stages, while we initialize from the full 16-layer Refinement by other means Decreasing the stride of version.) Fine-tuning takes three days on a single GPU for pooling layers is the most straightforward way to obtain the coarse FCN-32s version, and about one day each to finer predictions. However, doing so is problematic for our upgrade to the FCN-16s and FCN-8s versions. VGG16-based net. Setting the pool5 layer to have stride 1 Patch Sampling As explained in Section 3.4, our full requires our convolutionalized fc6 to have a kernel size of image training effectively batches each image into a regu- 32x upsampled 2x upsampled 16x upsampled 2x upsampled 8x upsampled prediction (FCN-32s) prediction prediction (FCN-16s) prediction prediction (FCN-8s) image pool1 pool2 pool3 pool4 pool5 pool4 P pool3 P prediction prediction Figure 3. Our DAG nets learn to combine coarse, high layer information with fine, low layer information. Layers are shown as grids that reveal relative spatial coarseness. Only pooling and prediction layers are shown; intermediate convolution layers (including our converted fully connected layers) are omitted. Solid line (FCN-32s): Our single-stream net, described in Section 4.1, upsamples stride 32 predictions back to pixels in a single step. Dashed line (FCN-16s): Combining predictions from both the final layer and the pool4 layer, at stride 16, lets our net predict finer details, while retaining high-level semantic information. Dotted line (FCN-8s): Additional predictions from pool3, at stride 8, provide further precision.

7. 1.2 1.2 labels for a much larger set of 8498 PASCAL training im- full images ages, which was used to train the previous state-of-the-art 1.0 50% sampling 1.0 25% sampling system, SDS [16]. This training data improves the FCN- 0.8 0.8 VGG16 validation score7 by 3.4 points to 59.4 mean IU. loss loss Implementation All models are trained and tested with 0.6 0.6 Caffe [18] on a single NVIDIA Tesla K40c. The models 0.4 0.4 and code will be released open-source on publication. 500 1000 1500 10000 20000 30000 iteration number relative time (num. images processed) 5. Results Figure 5. Training on whole images is just as effective as sampling We test our FCN on semantic segmentation and scene patches, but results in faster (wall time) convergence by making parsing, exploring PASCAL VOC, NYUDv2, and SIFT more efficient use of data. Left shows the effect of sampling on Flow. Although these tasks have historically distinguished convergence rate for a fixed expected batch size, while right plots between objects and regions, we treat both uniformly as the same by relative wall time. pixel prediction. We evaluate our FCN skip architecture8 on each of these datasets, and then extend it to multi-modal input for NYUDv2 and multi-task prediction for the seman- lar grid of large, overlapping patches. By contrast, prior tic and geometric labels of SIFT Flow. work randomly samples patches over a full dataset [27, 2, 8, 28, 11], potentially resulting in higher variance batches Metrics We report four metrics from common semantic that may accelerate convergence [22]. We study this trade- segmentation and scene parsing evaluations that are varia- off by spatially sampling the loss in the manner described tions on pixel accuracy and region intersection over union earlier, making an independent choice to ignore each final (IU). Let nij be the number of pixels of class i predicted to layer cell with some probability 1−p. To avoid changing the belong to class j, where there are ncl different classes, and effective batch size, we simultaneously increase the number let ti = j nij be the total number of pixels of class i. We of images per batch by a factor 1/p. Note that due to the ef- compute: ficiency of convolution, this form of rejection sampling is • pixel accuracy: i nii / i ti still faster than patchwise training for large enough values • mean accuraccy: (1/ncl ) i nii /ti of p (e.g., at least for p > 0.2 according to the numbers • mean IU: (1/ncl ) i nii / ti + j nji − nii in Section 3.1). Figure 5 shows the effect of this form of • frequency weighted IU: sampling on convergence. We find that sampling does not ( −1 nji − nii k tk ) i ti nii / ti + j have a significant effect on convergence rate compared to PASCAL VOC Table 3 gives the performance of our whole image training, but takes significantly more time due FCN-8s on the test sets of PASCAL VOC 2011 and 2012, to the larger number of images that need to be considered and compares it to the previous state-of-the-art, SDS [16], per batch. We therefore choose unsampled, whole image and the well-known R-CNN [12]. We achieve the best re- training in our other experiments. sults on mean IU9 by a relative margin of 20%. Inference Class Balancing Fully convolutional training can bal- time is reduced 114× (convnet only, ignoring proposals and ance classes by weighting or sampling the loss. Although refinement) or 286× (overall). our labels are mildly unbalanced (about 3/4 are back- ground), we find class balancing unnecessary. Table 3. Our fully convolutional net gives a 20% relative improve- Dense Prediction The scores are upsampled to the in- ment over the state-of-the-art on the PASCAL VOC 2011 and 2012 put dimensions by deconvolution layers within the net. Fi- test sets, and reduces inference time. mean IU mean IU inference nal layer deconvolutional filters are fixed to bilinear inter- VOC2011 test VOC2012 test time polation, while intermediate upsampling layers are initial- R-CNN [12] 47.9 - - ized to bilinear upsampling, and then learned. Shift-and- SDS [16] 52.6 51.6 ∼ 50 s stitch (Section 3.2), or the filter rarefaction equivalent, are FCN-8s 62.7 62.2 ∼ 175 ms not used. Augmentation We tried augmenting the training data NYUDv2 [30] is an RGB-D dataset collected using the by randomly mirroring and “jittering” the images by trans- lating them up to 32 pixels (the coarsest scale of prediction) 7 There are training images from [15] included in the PASCAL VOC in each direction. This yielded no noticeable improvement. 2011 val set, so we validate on the non-intersecting set of 736 images. An earlier version of this paper mistakenly evaluated on the entire val set. More Training Data The PASCAL VOC 2011 segmen- 8 Our models and code are publicly available at tation challenge training set, which we used for Table 1, https://github.com/BVLC/caffe/wiki/Model-Zoo#fcn. labels 1112 images. Hariharan et al. [15] have collected 9 This is the only metric provided by the test server.

8. Table 4. Results on NYUDv2. RGBD is early-fusion of the Table 5. Results on SIFT Flow10 with class segmentation RGB and depth channels at the input. HHA is the depth embed- (center) and geometric segmentation (right). Tighe [33] is ding of [14] as horizontal disparity, height above ground, and a non-parametric transfer method. Tighe 1 is an exemplar the angle of the local surface normal with the inferred gravity SVM while 2 is SVM + MRF. Farabet is a multi-scale con- direction. RGB-HHA is the jointly trained late fusion model vnet trained on class-balanced samples (1) or natural frequency that sums RGB and HHA predictions. samples (2). Pinheiro is a multi-scale, recurrent convnet, de- pixel mean mean f.w. noted RCNN3 (◦3 ). The metric for geometry is pixel accuracy. acc. acc. IU IU pixel mean mean f.w. geom. Gupta et al. [14] 60.3 - 28.6 47.0 acc. acc. IU IU acc. FCN-32s RGB 60.0 42.2 29.2 43.9 Liu et al. [23] 76.7 - - - - FCN-32s RGBD 61.5 42.4 30.5 45.5 Tighe et al. [33] - - - - 90.8 FCN-32s HHA 57.1 35.2 24.2 40.4 Tighe et al. [34] 1 75.6 41.1 - - - FCN-32s RGB-HHA 64.3 44.9 32.8 48.0 Tighe et al. [34] 2 78.6 39.2 - - - FCN-16s RGB-HHA 65.4 46.1 34.0 49.5 Farabet et al. [8] 1 72.3 50.8 - - - Farabet et al. [8] 2 78.5 29.6 - - - Pinheiro et al. [28] 77.7 29.8 - - - FCN-16s 85.2 51.7 39.5 76.1 94.3 Microsoft Kinect. It has 1449 RGB-D images, with pixel- wise labels that have been coalesced into a 40 class seman- FCN-8s SDS [16] Ground Truth Image tic segmentation task by Gupta et al. [13]. We report results on the standard split of 795 training images and 654 testing images. (Note: all model selection is performed on PAS- CAL 2011 val.) Table 4 gives the performance of our model in several variations. First we train our unmodified coarse model (FCN-32s) on RGB images. To add depth informa- tion, we train on a model upgraded to take four-channel RGB-D input (early fusion). This provides little benefit, perhaps due to the difficultly of propagating meaningful gradients all the way through the model. Following the suc- cess of Gupta et al. [14], we try the three-dimensional HHA encoding of depth, training nets on just this information, as well as a “late fusion” of RGB and HHA where the predic- tions from both nets are summed at the final layer, and the resulting two-stream net is learned end-to-end. Finally we upgrade this late fusion net to a 16-stride version. SIFT Flow is a dataset of 2,688 images with pixel labels Figure 6. Fully convolutional segmentation nets produce state- for 33 semantic categories (“bridge”, “mountain”, “sun”), of-the-art performance on PASCAL. The left column shows the as well as three geometric categories (“horizontal”, “verti- output of our highest performing net, FCN-8s. The second shows cal”, and “sky”). An FCN can naturally learn a joint repre- the segmentations produced by the previous state-of-the-art system sentation that simultaneously predicts both types of labels. by Hariharan et al. [16]. Notice the fine structures recovered (first We learn a two-headed version of FCN-16s with seman- row), ability to separate closely interacting objects (second row), tic and geometric prediction layers and losses. The learned and robustness to occluders (third row). The fourth row shows a model performs as well on both tasks as two independently failure case: the net sees lifejackets in a boat as people. trained models, while learning and inference are essentially as fast as each independent model by itself. The results in Table 5, computed on the standard split into 2,488 training 6. Conclusion and 200 test images,10 show state-of-the-art performance on Fully convolutional networks are a rich class of mod- both tasks. els, of which modern classification convnets are a spe- cial case. Recognizing this, extending these classification nets to segmentation, and improving the architecture with 10 Three of the SIFT Flow categories are not present in the test set. We multi-resolution layer combinations dramatically improves made predictions across all 33 categories, but only included categories ac- the state-of-the-art, while simultaneously simplifying and tually present in the test set in our evaluation. (An earlier version of this pa- per reported a lower mean IU, which included all categories either present speeding up learning and inference. or predicted in the evaluation.) Acknowledgements This work was supported in part

9.by DARPA’s MSEE and SMISC programs, NSF awards IIS- Table 6. Results on PASCAL-Context. CFM is the best result of [3] by convolutional feature masking and segment pursuit with the 1427425, IIS-1212798, IIS-1116411, and the NSF GRFP, VGG net. O2 P is the second order pooling method [1] as reported Toyota, and the Berkeley Vision and Learning Center. We in the errata of [26]. The 59 class task includes the 59 most fre- gratefully acknowledge NVIDIA for GPU donation. We quent classes while the 33 class task consists of an easier subset thank Bharath Hariharan and Saurabh Gupta for their ad- identified by [26]. vice and dataset tools. We thank Sergio Guadarrama for pixel mean mean f.w. reproducing GoogLeNet in Caffe. We thank Jitendra Malik 59 class acc. acc. IU IU for his helpful comments. Thanks to Wei Liu for pointing O2 P - - 18.1 - out an issue wth our SIFT Flow mean IU computation and CFM - - 31.5 - an error in our frequency weighted mean IU formula. FCN-32s 63.8 42.7 31.8 48.3 FCN-16s 65.7 46.2 34.8 50.7 A. Upper Bounds on IU FCN-8s 65.9 46.5 35.1 51.0 In this paper, we have achieved good performance on the mean IU segmentation metric even with coarse semantic 33 class prediction. To better understand this metric and the limits O2 P - - 29.2 - of this approach with respect to it, we compute approximate CFM - - 46.1 - upper bounds on performance with prediction at various FCN-32s 69.8 65.1 50.4 54.9 scales. We do this by downsampling ground truth images FCN-16s 71.8 68.0 53.4 57.5 and then upsampling them again to simulate the best results FCN-8s 71.8 67.6 53.5 57.7 obtainable with a particular downsampling factor. The fol- lowing table gives the mean IU on a subset of PASCAL v2 Add Appendix A giving upper bounds on mean IU and 2011 val for various downsampling factors. Appendix B with PASCAL-Context results. Correct PAS- factor mean IU CAL validation numbers (previously, some val images were 128 50.9 included in train), SIFT Flow mean IU (which used an in- 64 73.3 appropriately strict metric), and an error in the frequency 32 86.1 weighted mean IU formula. Add link to models and update 16 92.8 timing numbers to reflect improved implementation (which 8 96.4 is publicly available). 4 98.5 Pixel-perfect prediction is clearly not necessary to References achieve mean IU well above state-of-the-art, and, con- [1] J. Carreira, R. Caseiro, J. Batista, and C. Sminchisescu. Se- versely, mean IU is a not a good measure of fine-scale ac- mantic segmentation with second-order pooling. In ECCV, curacy. 2012. 9 [2] D. C. Ciresan, A. Giusti, L. M. Gambardella, and J. Schmid- B. More Results huber. Deep neural networks segment neuronal membranes in electron microscopy images. In NIPS, pages 2852–2860, We further evaluate our FCN for semantic segmentation. 2012. 1, 2, 4, 7 PASCAL-Context [26] provides whole scene annota- [3] J. Dai, K. He, and J. Sun. Convolutional feature mask- tions of PASCAL VOC 2010. While there are over 400 dis- ing for joint object and stuff segmentation. arXiv preprint tinct classes, we follow the 59 class task defined by [26] that arXiv:1412.1283, 2014. 9 picks the most frequent classes. We train and evaluate on [4] J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang, the training and val sets respectively. In Table 6, we com- E. Tzeng, and T. Darrell. DeCAF: A deep convolutional acti- pare to the joint object + stuff variation of Convolutional vation feature for generic visual recognition. In ICML, 2014. Feature Masking [3] which is the previous state-of-the-art 1, 2 on this task. FCN-8s scores 35.1 mean IU for an 11% rela- [5] D. Eigen, D. Krishnan, and R. Fergus. Restoring an image tive improvement. taken through a window covered with dirt or rain. In Com- puter Vision (ICCV), 2013 IEEE International Conference Changelog on, pages 633–640. IEEE, 2013. 2 [6] D. Eigen, C. Puhrsch, and R. Fergus. Depth map prediction The arXiv version of this paper is kept up-to-date with from a single image using a multi-scale deep network. arXiv corrections and additional relevant material. The following preprint arXiv:1406.2283, 2014. 2 gives a brief history of changes. [7] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. The PASCAL Visual Object Classes

10. Challenge 2011 (VOC2011) Results. http://www.pascal- [24] J. Long, N. Zhang, and T. Darrell. Do convnets learn corre- network.org/challenges/VOC/voc2011/workshop/index.html. spondence? In NIPS, 2014. 1 4 [25] O. Matan, C. J. Burges, Y. LeCun, and J. S. Denker. Multi- [8] C. Farabet, C. Couprie, L. Najman, and Y. LeCun. Learning digit recognition using a space displacement neural network. hierarchical features for scene labeling. Pattern Analysis and In NIPS, pages 488–495. Citeseer, 1991. 2 Machine Intelligence, IEEE Transactions on, 2013. 1, 2, 4, [26] R. Mottaghi, X. Chen, X. Liu, N.-G. Cho, S.-W. Lee, S. Fi- 7, 8 dler, R. Urtasun, and A. Yuille. The role of context for object [9] P. Fischer, A. Dosovitskiy, and T. Brox. Descriptor matching detection and semantic segmentation in the wild. In Com- with convolutional neural networks: a comparison to SIFT. puter Vision and Pattern Recognition (CVPR), 2014 IEEE CoRR, abs/1405.5769, 2014. 1 Conference on, pages 891–898. IEEE, 2014. 9 [10] L. Florack, B. T. H. Romeny, M. Viergever, and J. Koen- [27] F. Ning, D. Delhomme, Y. LeCun, F. Piano, L. Bottou, and derink. The gaussian scale-space paradigm and the multi- P. E. Barbano. Toward automatic phenotyping of developing scale local jet. International Journal of Computer Vision, embryos from videos. Image Processing, IEEE Transactions 18(1):61–75, 1996. 5 on, 14(9):1360–1371, 2005. 1, 2, 4, 7 [11] Y. Ganin and V. Lempitsky. N4 -fields: Neural network near- [28] P. H. Pinheiro and R. Collobert. Recurrent convolutional est neighbor fields for image transforms. In ACCV, 2014. 1, neural networks for scene labeling. In ICML, 2014. 1, 2, 2, 7 4, 7, 8 [12] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich fea- [29] P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, ture hierarchies for accurate object detection and semantic and Y. LeCun. Overfeat: Integrated recognition, localization segmentation. In Computer Vision and Pattern Recognition, and detection using convolutional networks. In ICLR, 2014. 2014. 1, 2, 7 1, 2, 3, 4 [13] S. Gupta, P. Arbelaez, and J. Malik. Perceptual organization [30] N. Silberman, D. Hoiem, P. Kohli, and R. Fergus. Indoor and recognition of indoor scenes from RGB-D images. In segmentation and support inference from rgbd images. In CVPR, 2013. 8 ECCV, 2012. 7 [14] S. Gupta, R. Girshick, P. Arbelaez, and J. Malik. Learning [31] K. Simonyan and A. Zisserman. Very deep convolu- rich features from RGB-D images for object detection and tional networks for large-scale image recognition. CoRR, segmentation. In ECCV. Springer, 2014. 1, 2, 8 abs/1409.1556, 2014. 1, 2, 3, 5 [15] B. Hariharan, P. Arbelaez, L. Bourdev, S. Maji, and J. Malik. [32] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, Semantic contours from inverse detectors. In International D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Conference on Computer Vision (ICCV), 2011. 7 Going deeper with convolutions. CoRR, abs/1409.4842, [16] B. Hariharan, P. Arbel´aez, R. Girshick, and J. Malik. Simul- 2014. 1, 2, 3, 5 taneous detection and segmentation. In European Confer- [33] J. Tighe and S. Lazebnik. Superparsing: scalable nonpara- ence on Computer Vision (ECCV), 2014. 1, 2, 4, 5, 7, 8 metric image parsing with superpixels. In ECCV, pages 352– [17] K. He, X. Zhang, S. Ren, and J. Sun. Spatial pyramid pooling 365. Springer, 2010. 8 in deep convolutional networks for visual recognition. In [34] J. Tighe and S. Lazebnik. Finding things: Image parsing with ECCV, 2014. 1, 2 regions and per-exemplar detectors. In CVPR, 2013. 8 [18] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Gir- [35] J. Tompson, A. Jain, Y. LeCun, and C. Bregler. Joint training shick, S. Guadarrama, and T. Darrell. Caffe: Convolu- of a convolutional network and a graphical model for human tional architecture for fast feature embedding. arXiv preprint pose estimation. CoRR, abs/1406.2984, 2014. 2 arXiv:1408.5093, 2014. 7 [36] L. Wan, M. Zeiler, S. Zhang, Y. L. Cun, and R. Fergus. Reg- [19] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet ularization of neural networks using dropconnect. In Pro- classification with deep convolutional neural networks. In ceedings of the 30th International Conference on Machine NIPS, 2012. 1, 2, 3, 5 Learning (ICML-13), pages 1058–1066, 2013. 4 [20] Q. V. Le, R. Monga, M. Devin, K. Chen, G. S. Corrado, [37] R. Wolf and J. C. Platt. Postal address block location using J. Dean, and A. Y. Ng. Building high-level features using a convolutional locator network. Advances in Neural Infor- large scale unsupervised learning. In ICML, 2012. 3 mation Processing Systems, pages 745–745, 1994. 2 [21] Y. LeCun, B. Boser, J. Denker, D. Henderson, R. E. Howard, [38] M. D. Zeiler and R. Fergus. Visualizing and understanding W. Hubbard, and L. D. Jackel. Backpropagation applied to convolutional networks. In Computer Vision–ECCV 2014, hand-written zip code recognition. In Neural Computation, pages 818–833. Springer, 2014. 2 1989. 2, 3 [39] N. Zhang, J. Donahue, R. Girshick, and T. Darrell. Part- [22] Y. A. LeCun, L. Bottou, G. B. Orr, and K.-R. M¨uller. Ef- based r-cnns for fine-grained category detection. In Com- ficient backprop. In Neural networks: Tricks of the trade, puter Vision–ECCV 2014, pages 834–849. Springer, 2014. pages 9–48. Springer, 1998. 7 1 [23] C. Liu, J. Yuen, and A. Torralba. Sift flow: Dense correspon- dence across scenes and its applications. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 33(5):978– 994, 2011. 8