Fully Convolutional Networks for Semantic Segmentation
展开查看详情
1. Fully Convolutional Networks for Semantic Segmentation Jonathan Long∗ Evan Shelhamer∗ Trevor Darrell UC Berkeley {jonlong,shelhamer,trevor}@cs.berkeley.edu ion . Abstract g.t arXiv:1411.4038v2 [cs.CV] 8 Mar 2015 forward/inference ict red t ion p a ise nt elw me backward/learning eg Convolutional networks are powerful visual models that pix s yield hierarchies of features. We show that convolu tional networks by themselves, trained endtoend, pixels 96 096 21 topixels, exceed the stateoftheart in semantic segmen 38 4 38 4 25 6 40 4 56 tation. Our key insight is to build “fully convolutional” 2 96 networks that take input of arbitrary size and produce correspondinglysized output with efficient inference and 21 learning. We define and detail the space of fully convolu tional networks, explain their application to spatially dense Figure 1. Fully convolutional networks can efficiently learn to make dense predictions for perpixel tasks like semantic segmen prediction tasks, and draw connections to prior models. We tation. adapt contemporary classification networks (AlexNet [19], the VGG net [31], and GoogLeNet [32]) into fully convolu tional networks and transfer their learned representations We show that a fully convolutional network (FCN), by finetuning [4] to the segmentation task. We then de trained endtoend, pixelstopixels on semantic segmen fine a novel architecture that combines semantic informa tation exceeds the stateoftheart without further machin tion from a deep, coarse layer with appearance information ery. To our knowledge, this is the first work to train FCNs from a shallow, fine layer to produce accurate and detailed endtoend (1) for pixelwise prediction and (2) from super segmentations. Our fully convolutional network achieves vised pretraining. Fully convolutional versions of existing stateoftheart segmentation of PASCAL VOC (20% rela networks predict dense outputs from arbitrarysized inputs. tive improvement to 62.2% mean IU on 2012), NYUDv2, Both learning and inference are performed wholeimageat and SIFT Flow, while inference takes less than one fifth of a atime by dense feedforward computation and backpropa second for a typical image. gation. Innetwork upsampling layers enable pixelwise pre diction and learning in nets with subsampled pooling. This method is efficient, both asymptotically and abso 1. Introduction lutely, and precludes the need for the complications in other Convolutional networks are driving advances in recog works. Patchwise training is common [27, 2, 8, 28, 11], but nition. Convnets are not only improving for wholeimage lacks the efficiency of fully convolutional training. Our ap classification [19, 31, 32], but also making progress on lo proach does not make use of pre and postprocessing com cal tasks with structured output. These include advances in plications, including superpixels [8, 16], proposals [16, 14], bounding box object detection [29, 12, 17], part and key or posthoc refinement by random fields or local classifiers point prediction [39, 24], and local correspondence [24, 9]. [8, 16]. Our model transfers recent success in classifica The natural next step in the progression from coarse to tion [19, 31, 32] to dense prediction by reinterpreting clas fine inference is to make a prediction at every pixel. Prior sification nets as fully convolutional and finetuning from approaches have used convnets for semantic segmentation their learned representations. In contrast, previous works [27, 2, 8, 28, 16, 14, 11], in which each pixel is labeled with have applied small convnets without supervised pretraining the class of its enclosing object or region, but with short [8, 28, 27]. comings that this work addresses. Semantic segmentation faces an inherent tension be tween semantics and location: global information resolves ∗ Authors contributed equally what while local information resolves where. Deep feature 1
2.hierarchies jointly encode location and semantics in a local [8], and Pinheiro and Collobert [28]; boundary prediction toglobal pyramid. We define a novel “skip” architecture for electron microscopy by Ciresan et al. [2] and for natural to combine deep, coarse, semantic information and shallow, images by a hybrid neural net/nearest neighbor model by fine, appearance information in Section 4.2 (see Figure 3). Ganin and Lempitsky [11]; and image restoration and depth In the next section, we review related work on deep clas estimation by Eigen et al. [5, 6]. Common elements of these sification nets, FCNs, and recent approaches to semantic approaches include segmentation using convnets. The following sections ex • small models restricting capacity and receptive fields; plain FCN design and dense prediction tradeoffs, introduce • patchwise training [27, 2, 8, 28, 11]; our architecture with innetwork upsampling and multi • postprocessing by superpixel projection, random field layer combinations, and describe our experimental frame regularization, filtering, or local classification [8, 2, work. Finally, we demonstrate stateoftheart results on 11]; PASCAL VOC 20112, NYUDv2, and SIFT Flow. • input shifting and output interlacing for dense output [28, 11] as introduced by OverFeat [29]; 2. Related work • multiscale pyramid processing [8, 28, 11]; Our approach draws on recent successes of deep nets • saturating tanh nonlinearities [8, 5, 28]; and for image classification [19, 31, 32] and transfer learning • ensembles [2, 11], [4, 38]. Transfer was first demonstrated on various visual whereas our method does without this machinery. However, recognition tasks [4, 38], then on detection, and on both we do study patchwise training 3.4 and “shiftandstitch” instance and semantic segmentation in hybrid proposal dense output 3.2 from the perspective of FCNs. We also classifier models [12, 16, 14]. We now rearchitect and fine discuss innetwork upsampling 3.3, of which the fully con tune classification nets to direct, dense prediction of seman nected prediction by Eigen et al. [6] is a special case. tic segmentation. We chart the space of FCNs and situate Unlike these existing methods, we adapt and extend deep prior models, both historical and recent, in this framework. classification architectures, using image classification as su Fully convolutional networks To our knowledge, the pervised pretraining, and finetune fully convolutionally to idea of extending a convnet to arbitrarysized inputs first learn simply and efficiently from whole image inputs and appeared in Matan et al. [25], which extended the classic whole image ground thruths. LeNet [21] to recognize strings of digits. Because their net Hariharan et al. [16] and Gupta et al. [14] likewise adapt was limited to onedimensional input strings, Matan et al. deep classification nets to semantic segmentation, but do used Viterbi decoding to obtain their outputs. Wolf and Platt so in hybrid proposalclassifier models. These approaches [37] expand convnet outputs to 2dimensional maps of de finetune an RCNN system [12] by sampling bounding tection scores for the four corners of postal address blocks. boxes and/or region proposals for detection, semantic seg Both of these historical works do inference and learning mentation, and instance segmentation. Neither method is fully convolutionally for detection. Ning et al. [27] define learned endtoend. a convnet for coarse multiclass segmentation of C. elegans They achieve stateoftheart results on PASCAL VOC tissues with fully convolutional inference. segmentation and NYUDv2 segmentation respectively, so Fully convolutional computation has also been exploited we directly compare our standalone, endtoend FCN to in the present era of manylayered nets. Sliding window their semantic segmentation results in Section 5. detection by Sermanet et al. [29], semantic segmentation by Pinheiro and Collobert [28], and image restoration by Eigen et al. [5] do fully convolutional inference. Fully con 3. Fully convolutional networks volutional training is rare, but used effectively by Tompson et al. [35] to learn an endtoend part detector and spatial Each layer of data in a convnet is a threedimensional model for pose estimation, although they do not exposit on array of size h × w × d, where h and w are spatial dimen or analyze this method. sions, and d is the feature or channel dimension. The first Alternatively, He et al. [17] discard the non layer is the image, with pixel size h × w, and d color chan convolutional portion of classification nets to make a nels. Locations in higher layers correspond to the locations feature extractor. They combine proposals and spatial in the image they are pathconnected to, which are called pyramid pooling to yield a localized, fixedlength feature their receptive fields. for classification. While fast and effective, this hybrid Convnets are built on translation invariance. Their ba model cannot be learned endtoend. sic components (convolution, pooling, and activation func Dense prediction with convnets Several recent works tions) operate on local input regions, and depend only on have applied convnets to dense prediction problems, includ relative spatial coordinates. Writing xij for the data vector ing semantic segmentation by Ning et al. [27], Farabet et al. at location (i, j) in a particular layer, and yij for the follow
3.ing layer, these functions compute outputs yij by ``tabby cat" 4 4 6 96 96 00 yij = fks ({xsi+δi,sj+δj }0≤δi,δj≤k ) 25 6 38 38 25 40 40 10 96 where k is called the kernel size, s is the stride or subsam convolutionalization pling factor, and fks determines the layer type: a matrix tabby cat heatmap multiplication for convolution or average pooling, a spatial max for max pooling, or an elementwise nonlinearity for an 96 96 00 activation function, and so on for other types of layers. 4 4 6 40 40 10 38 38 25 6 This functional form is maintained under composition, 25 with kernel size and stride obeying the transformation rule 96 Figure 2. Transforming fully connected layers into convolution fks ◦ gk s = (f ◦ g)k +(k−1)s ,ss . layers enables a classification net to output a heatmap. Adding layers and a spatial loss (as in Figure 1) produces an efficient ma While a general deep net computes a general nonlinear chine for endtoend dense learning. function, a net with only layers of this form computes a nonlinear filter, which we call a deep filter or fully convolu tional network. An FCN naturally operates on an input of is illustrated in Figure 2. (By contrast, nonconvolutional any size, and produces an output of corresponding (possibly nets, such as the one by Le et al. [20], lack this capability.) resampled) spatial dimensions. A realvalued loss function composed with an FCN de Furthermore, while the resulting maps are equivalent to fines a task. If the loss function is a sum over the spatial the evaluation of the original net on particular input patches, dimensions of the final layer, (x; θ) = ij (xij ; θ), its the computation is highly amortized over the overlapping gradient will be a sum over the gradients of each of its spa regions of those patches. For example, while AlexNet takes tial components. Thus stochastic gradient descent on com 1.2 ms (on a typical GPU) to produce the classification puted on whole images will be the same as stochastic gradi scores of a 227 × 227 image, the fully convolutional ver ent descent on , taking all of the final layer receptive fields sion takes 22 ms to produce a 10 × 10 grid of outputs from as a minibatch. a 500 × 500 image, which is more than 5 times faster than When these receptive fields overlap significantly, both the na¨ıve approach1 . feedforward computation and backpropagation are much The spatial output maps of these convolutionalized mod more efficient when computed layerbylayer over an entire els make them a natural choice for dense problems like se image instead of independently patchbypatch. mantic segmentation. With ground truth available at ev We next explain how to convert classification nets into ery output cell, both the forward and backward passes are fully convolutional nets that produce coarse output maps. straightforward, and both take advantage of the inherent For pixelwise prediction, we need to connect these coarse computational efficiency (and aggressive optimization) of outputs back to the pixels. Section 3.2 describes a trick that convolution. OverFeat [29] introduced for this purpose. We gain insight The corresponding backward times for the AlexNet ex into this trick by reinterpreting it as an equivalent network ample are 2.4 ms for a single image and 37 ms for a fully modification. As an efficient, effective alternative, we in convolutional 10 × 10 output map, resulting in a speedup troduce deconvolution layers for upsampling in Section 3.3. similar to that of the forward pass. This dense backpropa In Section 3.4 we consider training by patchwise sampling, gation is illustrated in Figure 1. and give evidence in Section 4.3 that our whole image train ing is faster and equally effective. While our reinterpretation of classification nets as fully convolutional yields output maps for inputs of any size, the 3.1. Adapting classifiers for dense prediction output dimensions are typically reduced by subsampling. The classification nets subsample to keep filters small and Typical recognition nets, including LeNet [21], AlexNet computational requirements reasonable. This coarsens the [19], and its deeper successors [31, 32], ostensibly take output of a fully convolutional version of these nets, reduc fixedsized inputs and produce nonspatial outputs. The fully ing it from the size of the input by a factor equal to the pixel connected layers of these nets have fixed dimensions and stride of the receptive fields of the output units. throw away spatial coordinates. However, these fully con nected layers can also be viewed as convolutions with ker nels that cover their entire input regions. Doing so casts 1 Assuming efficient batching of single image inputs. The classification them into fully convolutional networks that take input of scores for a single image by itself take 5.4 ms to produce, which is nearly any size and output classification maps. This transformation 25 times slower than the fully convolutional version.
4.3.2. Shiftandstitch is filter rarefaction Thus upsampling is performed innetwork for endtoend learning by backpropagation from the pixelwise loss. Input shifting and output interlacing is a trick that yields Note that the deconvolution filter in such a layer need not dense predictions from coarse outputs without interpola be fixed (e.g., to bilinear upsampling), but can be learned. tion, introduced by OverFeat [29]. If the outputs are down A stack of deconvolution layers and activation functions can sampled by a factor of f , the input is shifted (by left and top even learn a nonlinear upsampling. padding) x pixels to the right and y pixels down, once for In our experiments, we find that innetwork upsampling every value of (x, y) ∈ {0, . . . , f − 1} × {0, . . . , f − 1}. is fast and effective for learning dense prediction. Our best These f 2 inputs are each run through the convnet, and the segmentation architecture uses these layers to learn to up outputs are interlaced so that the predictions correspond to sample for refined prediction in Section 4.2. the pixels at the centers of their receptive fields. Changing only the filters and layer strides of a convnet 3.4. Patchwise training is loss sampling can produce the same output as this shiftandstitch trick. Consider a layer (convolution or pooling) with input stride In stochastic optimization, gradient computation is s, and a following convolution layer with filter weights fij driven by the training distribution. Both patchwise train (eliding the feature dimensions, irrelevant here). Setting the ing and fullyconvolutional training can be made to pro lower layer’s input stride to 1 upsamples its output by a fac duce any distribution, although their relative computational tor of s, just like shiftandstitch. However, convolving the efficiency depends on overlap and minibatch size. Whole original filter with the upsampled output does not produce image fully convolutional training is identical to patchwise the same result as the trick, because the original filter only training where each batch consists of all the receptive fields sees a reduced portion of its (now upsampled) input. To of the units below the loss for an image (or collection of reproduce the trick, rarefy the filter by enlarging it as images). While this is more efficient than uniform sampling of patches, it reduces the number of possible batches. How fi/s,j/s if s divides both i and j; ever, random selection of patches within an image may be fij = recovered simply. Restricting the loss to a randomly sam 0 otherwise, pled subset of its spatial terms (or, equivalently applying a (with i and j zerobased). Reproducing the full net output DropConnect mask [36] between the output and the loss) of the trick involves repeating this filter enlargement layer excludes patches from the gradient computation. bylayer until all subsampling is removed. If the kept patches still have significant overlap, fully Simply decreasing subsampling within a net is a tradeoff: convolutional computation will still speed up training. If the filters see finer information, but have smaller receptive gradients are accumulated over multiple backward passes, fields and take longer to compute. We have seen that the batches can include patches from several images.2 shiftandstitch trick is another kind of tradeoff: the output Sampling in patchwise training can correct class imbal is made denser without decreasing the receptive field sizes ance [27, 8, 2] and mitigate the spatial correlation of dense of the filters, but the filters are prohibited from accessing patches [28, 16]. In fully convolutional training, class bal information at a finer scale than their original design. ance can also be achieved by weighting the loss, and loss Although we have done preliminary experiments with sampling can be used to address spatial correlation. shiftandstitch, we do not use it in our model. We find We explore training with sampling in Section 4.3, and do learning through upsampling, as described in the next sec not find that it yields faster or better convergence for dense tion, to be more effective and efficient, especially when prediction. Whole image training is effective and efficient. combined with the skip layer fusion described later on. 4. Segmentation Architecture 3.3. Upsampling is backwards strided convolution We cast ILSVRC classifiers into FCNs and augment Another way to connect coarse outputs to dense pixels them for dense prediction with innetwork upsampling and is interpolation. For instance, simple bilinear interpolation a pixelwise loss. We train for segmentation by finetuning. computes each output yij from the nearest four inputs by a Next, we build a novel skip architecture that combines linear map that depends only on the relative positions of the coarse, semantic and local, appearance information to re input and output cells. fine prediction. In a sense, upsampling with factor f is convolution with For this investigation, we train and validate on the PAS a fractional input stride of 1/f . So long as f is integral, a CAL VOC 2011 segmentation challenge [7]. We train with natural way to upsample is therefore backwards convolution 2 Note that not every possible patch is included this way, since the re (sometimes called deconvolution) with an output stride of ceptive fields of the final layer units lie on a fixed, strided grid. However, f . Such an operation is trivial to implement, since it simply by shifting the image left and down by a random value up to the stride, reverses the forward and backward passes of convolution. random selection from all possible patches may be recovered.
5.a perpixel multinomial logistic loss and validate with the Table 1. We adapt and extend three classification convnets to seg mentation. We compare performance by mean intersection over standard metric of mean pixel intersection over union, with union on the validation set of PASCAL VOC 2011 and by infer the mean taken over all classes, including background. The ence time (averaged over 20 trials for a 500 × 500 input on an training ignores pixels that are masked out (as ambiguous NVIDIA Tesla K40c). We detail the architecture of the adapted or difficult) in the ground truth. nets as regards dense prediction: number of parameter layers, re ceptive field size of output units, and the coarsest stride within the 4.1. From classifier to dense FCN net. (These numbers give the best performance obtained at a fixed We begin by convolutionalizing proven classification ar learning rate, not best performance possible.) FCN FCN FCN chitectures as in Section 3. We consider the AlexNet3 ar AlexNet VGG16 GoogLeNet4 chitecture [19] that won ILSVRC12, as well as the VGG mean IU 39.8 56.0 42.5 nets [31] and the GoogLeNet4 [32] which did exception forward time 50 ms 210 ms 59 ms ally well in ILSVRC14. We pick the VGG 16layer net5 , conv. layers 8 16 22 which we found to be equivalent to the 19layer net on this parameters 57M 134M 6M task. For GoogLeNet, we use only the final loss layer, and rf size 355 404 907 improve performance by discarding the final average pool max stride 32 32 32 ing layer. We decapitate each net by discarding the final classifier layer, and convert all fully connected layers to convolutions. We append a 1 × 1 convolution with chan turns a line topology into a DAG, with edges that skip ahead nel dimension 21 to predict scores for each of the PAS from lower layers to higher ones (Figure 3). As they see CAL classes (including background) at each of the coarse fewer pixels, the finer scale predictions should need fewer output locations, followed by a deconvolution layer to bi layers, so it makes sense to make them from shallower net linearly upsample the coarse outputs to pixeldense outputs outputs. Combining fine layers and coarse layers lets the as described in Section 3.3. Table 1 compares the prelim model make local predictions that respect global structure. inary validation results along with the basic characteristics By analogy to the multiscale local jet of Florack et al. [10], of each net. We report the best results achieved after con we call our nonlinear local feature hierarchy the deep jet. vergence at a fixed learning rate (at least 175 epochs). We first divide the output stride in half by predicting Finetuning from classification to segmentation gave rea from a 16 pixel stride layer. We add a 1 × 1 convolution sonable predictions for each net. Even the worst model layer on top of pool4 to produce additional class predic achieved ∼ 75% of stateoftheart performance. The tions. We fuse this output with the predictions computed segmentationequippped VGG net (FCNVGG16) already on top of conv7 (convolutionalized fc7) at stride 32 by appears to be stateoftheart at 56.0 mean IU on val, com adding a 2× upsampling layer and summing6 both predic pared to 52.6 on test [16]. Training on extra data raises tions. (See Figure 3). We initialize the 2× upsampling to performance to 59.4 mean IU on a subset of val7 . Training bilinear interpolation, but allow the parameters to be learned details are given in Section 4.3. as described in Section 3.3. Finally, the stride 16 predictions Despite similar classification accuracy, our implementa are upsampled back to the image. We call this net FCN16s. tion of GoogLeNet did not match this segmentation result. FCN16s is learned endtoend, initialized with the param 4.2. Combining what and where eters of the last, coarser net, which we now call FCN32s. The new parameters acting on pool4 are zeroinitialized so We define a new fully convolutional net (FCN) for seg that the net starts with unmodified predictions. The learning mentation that combines layers of the feature hierarchy and rate is decreased by a factor of 100. refines the spatial precision of the output. See Figure 3. Learning this skip net improves performance on the val While fully convolutionalized classifiers can be fine idation set by 3.0 mean IU to 62.4. Figure 4 shows im tuned to segmentation as shown in 4.1, and even score provement in the fine structure of the output. We compared highly on the standard metric, their output is dissatisfyingly this fusion with learning only from the pool4 layer (which coarse (see Figure 4). The 32 pixel stride at the final predic resulted in poor performance), and simply decreasing the tion layer limits the scale of detail in the upsampled output. learning rate without adding the extra link (which results We address this by adding links that combine the final in an insignificant performance improvement, without im prediction layer with lower layers with finer strides. This proving the quality of the output). 3 Using We continue in this fashion by fusing predictions from the publicly available CaffeNet reference model. 4 Since there is no publicly available version of GoogLeNet, we use pool3 with a 2× upsampling of predictions fused from our own reimplementation. Our version is trained with less extensive data pool4 and conv7, building the net FCN8s. We obtain augmentation, and gets 68.5% top1 and 88.4% top5 ILSVRC accuracy. 5 Using the publicly available version from the Caffe model zoo. 6 Max fusion made learning difficult due to gradient switching.
6. FCN32s FCN16s FCN8s Ground truth 14 × 14 in order to maintain its receptive field size. In addi tion to their computational cost, we had difficulty learning such large filters. We made an attempt to rearchitect the layers above pool5 with smaller filters, but were not suc cessful in achieving comparable performance; one possible explanation is that the initialization from ImageNettrained weights in the upper layers is important. Another way to obtain finer predictions is to use the shift Figure 4. Refining fully convolutional nets by fusing information andstitch trick described in Section 3.2. In limited exper from layers with different strides improves segmentation detail. iments, we found the cost to improvement ratio from this The first three images show the output from our 32, 16, and 8 method to be worse than layer fusion. pixel stride nets (see Figure 3). 4.3. Experimental framework Table 2. Comparison of skip FCNs on a subset of PASCAL VOC2011 validation7 . Learning is endtoend, except for FCN Optimization We train by SGD with momentum. We 32sfixed, where only the last layer is finetuned. Note that FCN use a minibatch size of 20 images and fixed learning rates of 32s is FCNVGG16, renamed to highlight stride. 10−3 , 10−4 , and 5−5 for FCNAlexNet, FCNVGG16, and pixel mean mean f.w. FCNGoogLeNet, respectively, chosen by line search. We acc. acc. IU IU use momentum 0.9, weight decay of 5−4 or 2−4 , and dou FCN32sfixed 83.0 59.7 45.4 72.0 bled the learning rate for biases, although we found training FCN32s 89.1 73.3 59.4 81.4 to be insensitive to these parameters (but sensitive to the FCN16s 90.0 75.7 62.4 83.0 learning rate). We zeroinitialize the class scoring convo FCN8s 90.3 75.9 62.7 83.2 lution layer, finding random initialization to yield neither better performance nor faster convergence. Dropout was in cluded where used in the original classifier nets. a minor additional improvement to 62.7 mean IU, and find Finetuning We finetune all layers by back a slight improvement in the smoothness and detail of our propagation through the whole net. Finetuning the output. At this point our fusion improvements have met di output classifier alone yields only 70% of the full fine minishing returns, both with respect to the IU metric which tuning performance as compared in Table 2. Training from emphasizes largescale correctness, and also in terms of the scratch is not feasible considering the time required to improvement visible e.g. in Figure 4, so we do not continue learn the base classification nets. (Note that the VGG net is fusing even lower layers. trained in stages, while we initialize from the full 16layer Refinement by other means Decreasing the stride of version.) Finetuning takes three days on a single GPU for pooling layers is the most straightforward way to obtain the coarse FCN32s version, and about one day each to finer predictions. However, doing so is problematic for our upgrade to the FCN16s and FCN8s versions. VGG16based net. Setting the pool5 layer to have stride 1 Patch Sampling As explained in Section 3.4, our full requires our convolutionalized fc6 to have a kernel size of image training effectively batches each image into a regu 32x upsampled 2x upsampled 16x upsampled 2x upsampled 8x upsampled prediction (FCN32s) prediction prediction (FCN16s) prediction prediction (FCN8s) image pool1 pool2 pool3 pool4 pool5 pool4 P pool3 P prediction prediction Figure 3. Our DAG nets learn to combine coarse, high layer information with fine, low layer information. Layers are shown as grids that reveal relative spatial coarseness. Only pooling and prediction layers are shown; intermediate convolution layers (including our converted fully connected layers) are omitted. Solid line (FCN32s): Our singlestream net, described in Section 4.1, upsamples stride 32 predictions back to pixels in a single step. Dashed line (FCN16s): Combining predictions from both the final layer and the pool4 layer, at stride 16, lets our net predict finer details, while retaining highlevel semantic information. Dotted line (FCN8s): Additional predictions from pool3, at stride 8, provide further precision.
7. 1.2 1.2 labels for a much larger set of 8498 PASCAL training im full images ages, which was used to train the previous stateoftheart 1.0 50% sampling 1.0 25% sampling system, SDS [16]. This training data improves the FCN 0.8 0.8 VGG16 validation score7 by 3.4 points to 59.4 mean IU. loss loss Implementation All models are trained and tested with 0.6 0.6 Caffe [18] on a single NVIDIA Tesla K40c. The models 0.4 0.4 and code will be released opensource on publication. 500 1000 1500 10000 20000 30000 iteration number relative time (num. images processed) 5. Results Figure 5. Training on whole images is just as effective as sampling We test our FCN on semantic segmentation and scene patches, but results in faster (wall time) convergence by making parsing, exploring PASCAL VOC, NYUDv2, and SIFT more efficient use of data. Left shows the effect of sampling on Flow. Although these tasks have historically distinguished convergence rate for a fixed expected batch size, while right plots between objects and regions, we treat both uniformly as the same by relative wall time. pixel prediction. We evaluate our FCN skip architecture8 on each of these datasets, and then extend it to multimodal input for NYUDv2 and multitask prediction for the seman lar grid of large, overlapping patches. By contrast, prior tic and geometric labels of SIFT Flow. work randomly samples patches over a full dataset [27, 2, 8, 28, 11], potentially resulting in higher variance batches Metrics We report four metrics from common semantic that may accelerate convergence [22]. We study this trade segmentation and scene parsing evaluations that are varia off by spatially sampling the loss in the manner described tions on pixel accuracy and region intersection over union earlier, making an independent choice to ignore each final (IU). Let nij be the number of pixels of class i predicted to layer cell with some probability 1−p. To avoid changing the belong to class j, where there are ncl different classes, and effective batch size, we simultaneously increase the number let ti = j nij be the total number of pixels of class i. We of images per batch by a factor 1/p. Note that due to the ef compute: ficiency of convolution, this form of rejection sampling is • pixel accuracy: i nii / i ti still faster than patchwise training for large enough values • mean accuraccy: (1/ncl ) i nii /ti of p (e.g., at least for p > 0.2 according to the numbers • mean IU: (1/ncl ) i nii / ti + j nji − nii in Section 3.1). Figure 5 shows the effect of this form of • frequency weighted IU: sampling on convergence. We find that sampling does not ( −1 nji − nii k tk ) i ti nii / ti + j have a significant effect on convergence rate compared to PASCAL VOC Table 3 gives the performance of our whole image training, but takes significantly more time due FCN8s on the test sets of PASCAL VOC 2011 and 2012, to the larger number of images that need to be considered and compares it to the previous stateoftheart, SDS [16], per batch. We therefore choose unsampled, whole image and the wellknown RCNN [12]. We achieve the best re training in our other experiments. sults on mean IU9 by a relative margin of 20%. Inference Class Balancing Fully convolutional training can bal time is reduced 114× (convnet only, ignoring proposals and ance classes by weighting or sampling the loss. Although refinement) or 286× (overall). our labels are mildly unbalanced (about 3/4 are back ground), we find class balancing unnecessary. Table 3. Our fully convolutional net gives a 20% relative improve Dense Prediction The scores are upsampled to the in ment over the stateoftheart on the PASCAL VOC 2011 and 2012 put dimensions by deconvolution layers within the net. Fi test sets, and reduces inference time. mean IU mean IU inference nal layer deconvolutional filters are fixed to bilinear inter VOC2011 test VOC2012 test time polation, while intermediate upsampling layers are initial RCNN [12] 47.9   ized to bilinear upsampling, and then learned. Shiftand SDS [16] 52.6 51.6 ∼ 50 s stitch (Section 3.2), or the filter rarefaction equivalent, are FCN8s 62.7 62.2 ∼ 175 ms not used. Augmentation We tried augmenting the training data NYUDv2 [30] is an RGBD dataset collected using the by randomly mirroring and “jittering” the images by trans lating them up to 32 pixels (the coarsest scale of prediction) 7 There are training images from [15] included in the PASCAL VOC in each direction. This yielded no noticeable improvement. 2011 val set, so we validate on the nonintersecting set of 736 images. An earlier version of this paper mistakenly evaluated on the entire val set. More Training Data The PASCAL VOC 2011 segmen 8 Our models and code are publicly available at tation challenge training set, which we used for Table 1, https://github.com/BVLC/caffe/wiki/ModelZoo#fcn. labels 1112 images. Hariharan et al. [15] have collected 9 This is the only metric provided by the test server.
8. Table 4. Results on NYUDv2. RGBD is earlyfusion of the Table 5. Results on SIFT Flow10 with class segmentation RGB and depth channels at the input. HHA is the depth embed (center) and geometric segmentation (right). Tighe [33] is ding of [14] as horizontal disparity, height above ground, and a nonparametric transfer method. Tighe 1 is an exemplar the angle of the local surface normal with the inferred gravity SVM while 2 is SVM + MRF. Farabet is a multiscale con direction. RGBHHA is the jointly trained late fusion model vnet trained on classbalanced samples (1) or natural frequency that sums RGB and HHA predictions. samples (2). Pinheiro is a multiscale, recurrent convnet, de pixel mean mean f.w. noted RCNN3 (◦3 ). The metric for geometry is pixel accuracy. acc. acc. IU IU pixel mean mean f.w. geom. Gupta et al. [14] 60.3  28.6 47.0 acc. acc. IU IU acc. FCN32s RGB 60.0 42.2 29.2 43.9 Liu et al. [23] 76.7     FCN32s RGBD 61.5 42.4 30.5 45.5 Tighe et al. [33]     90.8 FCN32s HHA 57.1 35.2 24.2 40.4 Tighe et al. [34] 1 75.6 41.1    FCN32s RGBHHA 64.3 44.9 32.8 48.0 Tighe et al. [34] 2 78.6 39.2    FCN16s RGBHHA 65.4 46.1 34.0 49.5 Farabet et al. [8] 1 72.3 50.8    Farabet et al. [8] 2 78.5 29.6    Pinheiro et al. [28] 77.7 29.8    FCN16s 85.2 51.7 39.5 76.1 94.3 Microsoft Kinect. It has 1449 RGBD images, with pixel wise labels that have been coalesced into a 40 class seman FCN8s SDS [16] Ground Truth Image tic segmentation task by Gupta et al. [13]. We report results on the standard split of 795 training images and 654 testing images. (Note: all model selection is performed on PAS CAL 2011 val.) Table 4 gives the performance of our model in several variations. First we train our unmodified coarse model (FCN32s) on RGB images. To add depth informa tion, we train on a model upgraded to take fourchannel RGBD input (early fusion). This provides little benefit, perhaps due to the difficultly of propagating meaningful gradients all the way through the model. Following the suc cess of Gupta et al. [14], we try the threedimensional HHA encoding of depth, training nets on just this information, as well as a “late fusion” of RGB and HHA where the predic tions from both nets are summed at the final layer, and the resulting twostream net is learned endtoend. Finally we upgrade this late fusion net to a 16stride version. SIFT Flow is a dataset of 2,688 images with pixel labels Figure 6. Fully convolutional segmentation nets produce state for 33 semantic categories (“bridge”, “mountain”, “sun”), oftheart performance on PASCAL. The left column shows the as well as three geometric categories (“horizontal”, “verti output of our highest performing net, FCN8s. The second shows cal”, and “sky”). An FCN can naturally learn a joint repre the segmentations produced by the previous stateoftheart system sentation that simultaneously predicts both types of labels. by Hariharan et al. [16]. Notice the fine structures recovered (first We learn a twoheaded version of FCN16s with seman row), ability to separate closely interacting objects (second row), tic and geometric prediction layers and losses. The learned and robustness to occluders (third row). The fourth row shows a model performs as well on both tasks as two independently failure case: the net sees lifejackets in a boat as people. trained models, while learning and inference are essentially as fast as each independent model by itself. The results in Table 5, computed on the standard split into 2,488 training 6. Conclusion and 200 test images,10 show stateoftheart performance on Fully convolutional networks are a rich class of mod both tasks. els, of which modern classification convnets are a spe cial case. Recognizing this, extending these classification nets to segmentation, and improving the architecture with 10 Three of the SIFT Flow categories are not present in the test set. We multiresolution layer combinations dramatically improves made predictions across all 33 categories, but only included categories ac the stateoftheart, while simultaneously simplifying and tually present in the test set in our evaluation. (An earlier version of this pa per reported a lower mean IU, which included all categories either present speeding up learning and inference. or predicted in the evaluation.) Acknowledgements This work was supported in part
9.by DARPA’s MSEE and SMISC programs, NSF awards IIS Table 6. Results on PASCALContext. CFM is the best result of [3] by convolutional feature masking and segment pursuit with the 1427425, IIS1212798, IIS1116411, and the NSF GRFP, VGG net. O2 P is the second order pooling method [1] as reported Toyota, and the Berkeley Vision and Learning Center. We in the errata of [26]. The 59 class task includes the 59 most fre gratefully acknowledge NVIDIA for GPU donation. We quent classes while the 33 class task consists of an easier subset thank Bharath Hariharan and Saurabh Gupta for their ad identified by [26]. vice and dataset tools. We thank Sergio Guadarrama for pixel mean mean f.w. reproducing GoogLeNet in Caffe. We thank Jitendra Malik 59 class acc. acc. IU IU for his helpful comments. Thanks to Wei Liu for pointing O2 P   18.1  out an issue wth our SIFT Flow mean IU computation and CFM   31.5  an error in our frequency weighted mean IU formula. FCN32s 63.8 42.7 31.8 48.3 FCN16s 65.7 46.2 34.8 50.7 A. Upper Bounds on IU FCN8s 65.9 46.5 35.1 51.0 In this paper, we have achieved good performance on the mean IU segmentation metric even with coarse semantic 33 class prediction. To better understand this metric and the limits O2 P   29.2  of this approach with respect to it, we compute approximate CFM   46.1  upper bounds on performance with prediction at various FCN32s 69.8 65.1 50.4 54.9 scales. We do this by downsampling ground truth images FCN16s 71.8 68.0 53.4 57.5 and then upsampling them again to simulate the best results FCN8s 71.8 67.6 53.5 57.7 obtainable with a particular downsampling factor. The fol lowing table gives the mean IU on a subset of PASCAL v2 Add Appendix A giving upper bounds on mean IU and 2011 val for various downsampling factors. Appendix B with PASCALContext results. Correct PAS factor mean IU CAL validation numbers (previously, some val images were 128 50.9 included in train), SIFT Flow mean IU (which used an in 64 73.3 appropriately strict metric), and an error in the frequency 32 86.1 weighted mean IU formula. Add link to models and update 16 92.8 timing numbers to reflect improved implementation (which 8 96.4 is publicly available). 4 98.5 Pixelperfect prediction is clearly not necessary to References achieve mean IU well above stateoftheart, and, con [1] J. Carreira, R. Caseiro, J. Batista, and C. Sminchisescu. Se versely, mean IU is a not a good measure of finescale ac mantic segmentation with secondorder pooling. In ECCV, curacy. 2012. 9 [2] D. C. Ciresan, A. Giusti, L. M. Gambardella, and J. Schmid B. More Results huber. Deep neural networks segment neuronal membranes in electron microscopy images. In NIPS, pages 2852–2860, We further evaluate our FCN for semantic segmentation. 2012. 1, 2, 4, 7 PASCALContext [26] provides whole scene annota [3] J. Dai, K. He, and J. Sun. Convolutional feature mask tions of PASCAL VOC 2010. While there are over 400 dis ing for joint object and stuff segmentation. arXiv preprint tinct classes, we follow the 59 class task defined by [26] that arXiv:1412.1283, 2014. 9 picks the most frequent classes. We train and evaluate on [4] J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang, the training and val sets respectively. In Table 6, we com E. Tzeng, and T. Darrell. DeCAF: A deep convolutional acti pare to the joint object + stuff variation of Convolutional vation feature for generic visual recognition. In ICML, 2014. Feature Masking [3] which is the previous stateoftheart 1, 2 on this task. FCN8s scores 35.1 mean IU for an 11% rela [5] D. Eigen, D. Krishnan, and R. Fergus. Restoring an image tive improvement. taken through a window covered with dirt or rain. In Com puter Vision (ICCV), 2013 IEEE International Conference Changelog on, pages 633–640. IEEE, 2013. 2 [6] D. Eigen, C. Puhrsch, and R. Fergus. Depth map prediction The arXiv version of this paper is kept uptodate with from a single image using a multiscale deep network. arXiv corrections and additional relevant material. The following preprint arXiv:1406.2283, 2014. 2 gives a brief history of changes. [7] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. The PASCAL Visual Object Classes
10. Challenge 2011 (VOC2011) Results. http://www.pascal [24] J. Long, N. Zhang, and T. Darrell. Do convnets learn corre network.org/challenges/VOC/voc2011/workshop/index.html. spondence? In NIPS, 2014. 1 4 [25] O. Matan, C. J. Burges, Y. LeCun, and J. S. Denker. Multi [8] C. Farabet, C. Couprie, L. Najman, and Y. LeCun. Learning digit recognition using a space displacement neural network. hierarchical features for scene labeling. Pattern Analysis and In NIPS, pages 488–495. Citeseer, 1991. 2 Machine Intelligence, IEEE Transactions on, 2013. 1, 2, 4, [26] R. Mottaghi, X. Chen, X. Liu, N.G. Cho, S.W. Lee, S. Fi 7, 8 dler, R. Urtasun, and A. Yuille. The role of context for object [9] P. Fischer, A. Dosovitskiy, and T. Brox. Descriptor matching detection and semantic segmentation in the wild. In Com with convolutional neural networks: a comparison to SIFT. puter Vision and Pattern Recognition (CVPR), 2014 IEEE CoRR, abs/1405.5769, 2014. 1 Conference on, pages 891–898. IEEE, 2014. 9 [10] L. Florack, B. T. H. Romeny, M. Viergever, and J. Koen [27] F. Ning, D. Delhomme, Y. LeCun, F. Piano, L. Bottou, and derink. The gaussian scalespace paradigm and the multi P. E. Barbano. Toward automatic phenotyping of developing scale local jet. International Journal of Computer Vision, embryos from videos. Image Processing, IEEE Transactions 18(1):61–75, 1996. 5 on, 14(9):1360–1371, 2005. 1, 2, 4, 7 [11] Y. Ganin and V. Lempitsky. N4 fields: Neural network near [28] P. H. Pinheiro and R. Collobert. Recurrent convolutional est neighbor fields for image transforms. In ACCV, 2014. 1, neural networks for scene labeling. In ICML, 2014. 1, 2, 2, 7 4, 7, 8 [12] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich fea [29] P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, ture hierarchies for accurate object detection and semantic and Y. LeCun. Overfeat: Integrated recognition, localization segmentation. In Computer Vision and Pattern Recognition, and detection using convolutional networks. In ICLR, 2014. 2014. 1, 2, 7 1, 2, 3, 4 [13] S. Gupta, P. Arbelaez, and J. Malik. Perceptual organization [30] N. Silberman, D. Hoiem, P. Kohli, and R. Fergus. Indoor and recognition of indoor scenes from RGBD images. In segmentation and support inference from rgbd images. In CVPR, 2013. 8 ECCV, 2012. 7 [14] S. Gupta, R. Girshick, P. Arbelaez, and J. Malik. Learning [31] K. Simonyan and A. Zisserman. Very deep convolu rich features from RGBD images for object detection and tional networks for largescale image recognition. CoRR, segmentation. In ECCV. Springer, 2014. 1, 2, 8 abs/1409.1556, 2014. 1, 2, 3, 5 [15] B. Hariharan, P. Arbelaez, L. Bourdev, S. Maji, and J. Malik. [32] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, Semantic contours from inverse detectors. In International D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Conference on Computer Vision (ICCV), 2011. 7 Going deeper with convolutions. CoRR, abs/1409.4842, [16] B. Hariharan, P. Arbel´aez, R. Girshick, and J. Malik. Simul 2014. 1, 2, 3, 5 taneous detection and segmentation. In European Confer [33] J. Tighe and S. Lazebnik. Superparsing: scalable nonpara ence on Computer Vision (ECCV), 2014. 1, 2, 4, 5, 7, 8 metric image parsing with superpixels. In ECCV, pages 352– [17] K. He, X. Zhang, S. Ren, and J. Sun. Spatial pyramid pooling 365. Springer, 2010. 8 in deep convolutional networks for visual recognition. In [34] J. Tighe and S. Lazebnik. Finding things: Image parsing with ECCV, 2014. 1, 2 regions and perexemplar detectors. In CVPR, 2013. 8 [18] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Gir [35] J. Tompson, A. Jain, Y. LeCun, and C. Bregler. Joint training shick, S. Guadarrama, and T. Darrell. Caffe: Convolu of a convolutional network and a graphical model for human tional architecture for fast feature embedding. arXiv preprint pose estimation. CoRR, abs/1406.2984, 2014. 2 arXiv:1408.5093, 2014. 7 [36] L. Wan, M. Zeiler, S. Zhang, Y. L. Cun, and R. Fergus. Reg [19] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet ularization of neural networks using dropconnect. In Pro classification with deep convolutional neural networks. In ceedings of the 30th International Conference on Machine NIPS, 2012. 1, 2, 3, 5 Learning (ICML13), pages 1058–1066, 2013. 4 [20] Q. V. Le, R. Monga, M. Devin, K. Chen, G. S. Corrado, [37] R. Wolf and J. C. Platt. Postal address block location using J. Dean, and A. Y. Ng. Building highlevel features using a convolutional locator network. Advances in Neural Infor large scale unsupervised learning. In ICML, 2012. 3 mation Processing Systems, pages 745–745, 1994. 2 [21] Y. LeCun, B. Boser, J. Denker, D. Henderson, R. E. Howard, [38] M. D. Zeiler and R. Fergus. Visualizing and understanding W. Hubbard, and L. D. Jackel. Backpropagation applied to convolutional networks. In Computer Vision–ECCV 2014, handwritten zip code recognition. In Neural Computation, pages 818–833. Springer, 2014. 2 1989. 2, 3 [39] N. Zhang, J. Donahue, R. Girshick, and T. Darrell. Part [22] Y. A. LeCun, L. Bottou, G. B. Orr, and K.R. M¨uller. Ef based rcnns for finegrained category detection. In Com ficient backprop. In Neural networks: Tricks of the trade, puter Vision–ECCV 2014, pages 834–849. Springer, 2014. pages 9–48. Springer, 1998. 7 1 [23] C. Liu, J. Yuen, and A. Torralba. Sift flow: Dense correspon dence across scenes and its applications. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 33(5):978– 994, 2011. 8

Graph Features in Spark 3.0  Integrating Graph Querying and Algorithms in Spark Graphg
Spark开源社区

Updates from Project Hydrogen  Unifying StateoftheArt AI and Big Data in Apache Spark
Spark开源社区

Tensorflow Faster RCNN 2.0
GDG

Deep learning and gene computing acceleration with alluxio in kubernetes
Alluxio

tf.data: TensorFlow Input Pipeline
Alluxio