Accurate Image Super-Resolution Using Very Deep Convolutional Networks

We present a highly accurate single-image superresolution (SR) method. Our method uses a very deep convolutional network inspired by VGG-net used for ImageNet classification [19]. We find increasing our network depth shows a significant improvement in accuracy. Our final model uses 20 weight layers. By cascading small filters many times in a deep network structure, contextual information over large image regions is exploited in an efficient way. With very deep networks, however, convergence speed becomes a critical issue during training. We propose a simple yet effective training procedure. We learn residuals only and use extremely high learning rates (104 times higher than SRCNN [6]) enabled by adjustable gradient clipping. Our proposed method performs better than existing methods in accuracy and visual improvements in our results are easily noticeable.

1. Accurate Image Super-Resolution Using Very Deep Convolutional Networks Jiwon Kim, Jung Kwon Lee and Kyoung Mu Lee Department of ECE, ASRI, Seoul National University, Korea {, deruci, kyoungmu} arXiv:1511.04587v2 [cs.CV] 11 Nov 2016 Abstract 37.6 VDSR (Ours) 37.4 We present a highly accurate single-image super- resolution (SR) method. Our method uses a very deep con- 37.2 PSNR (dB) volutional network inspired by VGG-net used for ImageNet 37 classification [19]. We find increasing our network depth shows a significant improvement in accuracy. Our final 36.8 model uses 20 weight layers. By cascading small filters SRCNN many times in a deep network structure, contextual infor- 36.6 A+ mation over large image regions is exploited in an efficient SelfEx RFL 36.4 way. With very deep networks, however, convergence speed 10 2 10 1 10 0 10 -1 10 -2 becomes a critical issue during training. We propose a sim- slow running time(s) fast ple yet effective training procedure. We learn residuals only Figure 1: Our VDSR improves PSNR for scale factor ×2 on and use extremely high learning rates (104 times higher dataset Set5 in comparison to the state-of-the-art methods (SR- than SRCNN [6]) enabled by adjustable gradient clipping. CNN uses the public slower implementation using CPU). VDSR Our proposed method performs better than existing meth- outperforms SRCNN by a large margin (0.87 dB). ods in accuracy and visual improvements in our results are easily noticeable. end-to-end manner. Their method, termed SRCNN, does not require any engineered features that are typically neces- 1. Introduction sary in other methods [25, 26, 21, 22] and shows the state- We address the problem of generating a high-resolution of-the-art performance. (HR) image given a low-resolution (LR) image, commonly While SRCNN successfully introduced a deep learning referred as single image super-resolution (SISR) [12], [8], technique into the super-resolution (SR) problem, we find [9]. SISR is widely used in computer vision applications its limitations in three aspects: first, it relies on the con- ranging from security and surveillance imaging to medical text of small image regions; second, training converges too imaging where more image details are required on demand. slowly; third, the network only works for a single scale. Many SISR methods have been studied in the computer In this work, we propose a new method to practically vision community. Early methods include interpolation resolve the issues. such as bicubic interpolation and Lanczos resampling [7] Context We utilize contextual information spread over more powerful methods utilizing statistical image priors very large image regions. For a large scale factor, it is often [20, 13] or internal patch recurrence [9]. the case that information contained in a small patch is not Currently, learning methods are widely used to model a sufficient for detail recovery (ill-posed). Our very deep net- mapping from LR to HR patches. Neighbor embedding [4, work using large receptive field takes a large image context 15] methods interpolate the patch subspace. Sparse coding into account. [25, 26, 21, 22] methods use a learned compact dictionary Convergence We suggest a way to speed-up the train- based on sparse signal representation. Lately, random forest ing: residual-learning CNN and extremely high learning [18] and convolutional neural network (CNN) [6] have also rates. As LR image and HR image share the same infor- been used with large improvements in accuracy. mation to a large extent, explicitly modelling the residual Among them, Dong et al. [6] has demonstrated that a image, which is the difference between HR and LR images, CNN can be used to learn a mapping from LR to HR in an is advantageous. We propose a network structure for effi- 1

2.cient learning when input and output are highly correlated. does. Training time might be spent on learning this auto- Moreover, our initial learning rate is 104 times higher than encoder so that the convergence rate of learning the other that of SRCNN [6]. This is enabled by residual-learning part (image details) is significantly decreased. In contrast, and gradient clipping. since our network models the residual images directly, we Scale Factor We propose a single-model SR approach. can have much faster convergence with even better accu- Scales are typically user-specified and can be arbitrary in- racy. cluding fractions. For example, one might need smooth Scale As in most existing SR methods, SRCNN is zoom-in in an image viewer or resizing to a specific dimen- trained for a single scale factor and is supposed to work sion. Training and storing many scale-dependent models in only with the specified scale. Thus, if a new scale is on de- preparation for all possible scenarios is impractical. We find mand, a new model has to be trained. To cope with multiple a single convolutional network is sufficient for multi-scale- scale SR (possibly including fractional factors), we need to factor super-resolution. construct individual single scale SR system for each scale Contribution In summary, in this work, we propose a of interest. highly accurate SR method based on a very deep convolu- However, preparing many individual machines for all tional network. Very deep networks converge too slowly possible scenarios to cope with multiple scales is inefficient if small learning rates are used. Boosting convergence rate and impractical. In this work, we design and train a sin- with high learning rates lead to exploding gradients and we gle network to handle multiple scale SR problem efficiently. resolve the issue with residual-learning and gradient clip- This turns out to work very well. Our single machine is ping. In addition, we extend our work to cope with multi- compared favorably to a single-scale expert for the given scale SR problem in a single network. Our method is rel- sub-task. For three scales factors (×2, 3, 4), we can reduce atively accurate and fast in comparison to state-of-the-art the number of parameters by three-fold. methods as illustrated in Figure 1. In addition to the aforementioned issues, there are some minor differences. Our output image has the same size as 2. Related Work the input image by padding zeros every layer during train- SRCNN is a representative state-of-art method for deep ing whereas output from SRCNN is smaller than the input. learning-based SR approach. So, let us analyze and com- Finally, we simply use the same learning rates for all lay- pare it with our proposed method. ers while SRCNN uses different learning rates for different layers in order to achieve stable convergence. 2.1. Convolutional Network for Image Super- Resolution 3. Proposed Method Model SRCNN consists of three layers: patch extrac- 3.1. Proposed Network tion/representation, non-linear mapping and reconstruction. Filters of spatial sizes 9 × 9, 1 × 1, and 5 × 5 were used For SR image reconstruction, we use a very deep convo- respectively. lutional network inspired by Simonyan and Zisserman [19]. In [6], Dong et al. attempted to prepare deeper models, The configuration is outlined in Figure 2. We use d layers but failed to observe superior performance after a week of where layers except the first and the last are of the same training. In some cases, deeper models gave inferior perfor- type: 64 filter of the size 3 × 3 × 64, where a filter operates mance. They conclude that deeper networks do not result in on 3 × 3 spatial region across 64 channels (feature maps). better performance (Figure 9). The first layer operates on the input image. The last layer, However, we argue that increasing depth significantly used for image reconstruction, consists of a single filter of boosts performance. We successfully use 20 weight lay- size 3 × 3 × 64. ers (3 × 3 for each layer). Our network is very deep (20 The network takes an interpolated low-resolution image vs. 3 [6]) and information used for reconstruction (recep- (to the desired size) as input and predicts image details. tive field) is much larger (41 × 41 vs. 13 × 13). Modelling image details is often used in super-resolution Training For training, SRCNN directly models high- methods [21, 22, 15, 3] and we find that CNN-based meth- resolution images. A high-resolution image can be de- ods can benefit from this domain-specific knowledge. composed into a low frequency information (corresponding In this work, we demonstrate that explicitly modelling to low-resolution image) and high frequency information image details (residuals) has several advantages. These are (residual image or image details). Input and output images further discussed later in Section 4.2. share the same low-frequency information. This indicates One problem with using a very deep network to predict that SRCNN serves two purposes: carrying the input to the dense outputs is that the size of the feature map gets reduced end layer and reconstructing residuals. Carrying the input every time convolution operations are applied. For example, to the end is conceptually similar to what an auto-encoder when an input of size (n+1)×(n+1) is applied to a network

3. ILR Conv.1 ReLu.1 Conv.D-1 ReLu.D-1 Conv.D (Residual) HR x r y Figure 2: Our Network Structure. We cascade a pair of layers (convolutional and nonlinear) repeatedly. An interpolated low-resolution (ILR) image goes through layers and transforms into a high-resolution (HR) image. The network predicts a residual image and the addition of ILR and the residual gives the desired output. We use 64 filters for each convolutional layer and some sample feature maps are drawn for visualization. Most features after applying rectified linear units (ReLu) are zero. with receptive field size n × n, the output image is 1 × 1. squared error 12 ||y − f (x)||2 averaged over the training set This is in accordance with other super-resolution meth- is minimized. ods since many require surrounding pixels to infer cen- Residual-Learning In SRCNN, the exact copy of the in- ter pixels correctly. This center-surround relation is use- put has to go through all layers until it reaches the output ful since the surrounding region provides more constraints layer. With many weight layers, this becomes an end-to- to this ill-posed problem (SR). For pixels near the image end relation requiring very long-term memory. For this rea- boundary, this relation cannot be exploited to the full extent son, the vanishing/exploding gradients problem [2] can be and many SR methods crop the result image. critical. We can solve this problem simply with residual- This methodology, however, is not valid if the required learning. surround region is very big. After cropping, the final image As the input and output images are largely similar, we is too small to be visually pleasing. define a residual image r = y − x, where most values are To resolve this issue, we pad zeros before convolutions likely to be zero or small. We want to predict this resid- to keep the sizes of all feature maps (including the output ual image. The loss function now becomes 21 ||r − f (x)||2 , image) the same. It turns out that zero-padding works sur- where f (x) is the network prediction. prisingly well. For this reason, our method differs from In networks, this is reflected in the loss layer as follows. most other methods in the sense that pixels near the image Our loss layer takes three inputs: residual estimate, network boundary are also correctly predicted. input (ILR image) and ground truth HR image. The loss Once image details are predicted, they are added back to is computed as the Euclidean distance between the recon- the input ILR image to give the final image (HR). We use structed image (the sum of network input and output) and this structure for all experiments in our work. ground truth. Training is carried out by optimizing the regression ob- 3.2. Training jective using mini-batch gradient descent based on back- We now describe the objective to minimize in order to propagation (LeCun et al. [14]). We set the momentum find optimal parameters of our model. Let x denote an in- parameter to 0.9. The training is regularized by weight de- terpolated low-resolution image and y a high-resolution im- cay (L2 penalty multiplied by 0.0001). age. Given a training dataset {x(i) , y(i) }N i=1 , our goal is to High Learning Rates for Very Deep Networks Train- learn a model f that predicts values y ˆ = f (x), where y ˆ is ing deep models can fail to converge in realistic limit of an estimate of the target HR image. We minimize the mean time. SRCNN [6] fails to show superior performance with

4. Epoch 10 20 40 80 more than three weight layers. While there can be various Residual 36.90 36.64 37.12 37.05 reasons, one possibility is that they stopped their training Non-Residual 27.42 19.59 31.38 35.66 procedure before networks converged. Their learning rate Difference 9.48 17.05 5.74 1.39 10−5 is too small for a network to converge within a week (a) Initial learning rate 0.1 on a common GPU. Looking at Fig. 9 of [6], it is not easy to Epoch 10 20 40 80 say their deeper networks have converged and their perfor- Residual 36.74 36.87 36.91 36.93 mances were saturated. While more training will eventually Non-Residual 30.33 33.59 36.26 36.42 resolve the issue, but increasing depth to 20 does not seems Difference 6.41 3.28 0.65 0.52 practical with SRCNN. (b) Initial learning rate 0.01 It is a basic rule of thumb to make learning rate high to boost training. But simply setting learning rate high can Epoch 10 20 40 80 also lead to vanishing/exploding gradients [2]. For the rea- Residual 36.31 36.46 36.52 36.52 son, we suggest an adjustable gradient clipping for maximal Non-Residual 33.97 35.08 36.11 36.11 boost in speed while suppressing exploding gradients. Difference 2.35 1.38 0.42 0.40 Adjustable Gradient Clipping Gradient clipping is a (c) Initial learning rate 0.001 technique that is often used in training recurrent neural net- works [17]. But, to our knowledge, its usage is limited in Table 1: Performance table (PSNR) for residual and non-residual training CNNs. While there exist many ways to limit gra- networks (‘Set5’ dataset, × 2). Residual networks rapidly ap- proach their convergence within 10 epochs. dients, one of the common strategies is to clip individual gradients to the predefined range [−θ, θ]. With clipping, gradients are in a certain range. With stochastic gradient descent commonly used for training, 4. Understanding Properties learning rate is multiplied to adjust the step size. If high learning rate is used, it is likely that θ is tuned to be small In this section, we study three properties of our proposed to avoid exploding gradients in a high learning rate regime. method. First, we show that large depth is necessary for But as learning rate is annealed to get smaller, the effective the task of SR. A very deep network utilizes more con- gradient (gradient multiplied by learning rate) approaches textual information in an image and models complex func- zero and training can take exponentially many iterations to tions with many nonlinear layers. We experimentally verify converge if learning rate is decreased geometrically. that deeper networks give better performances than shallow For maximal speed of convergence, we clip the gradients ones. to [− γθ , γθ ], where γ denotes the current learning rate. We Second, we show that our residual-learning network con- find the adjustable gradient clipping makes our convergence verges much faster than the standard CNN. Moreover, our procedure extremely fast. Our 20-layer network training is network gives a significant boost in performance. done within 4 hours whereas 3-layer SRCNN takes several Third, we show that our method with a single network days to train. performs as well as a method using multiple networks Multi-Scale While very deep models can boost perfor- trained for each scale. We can effectively reduce model mance, more parameters are now needed to define a net- capacity (the number of parameters) of multi-network ap- work. Typically, one network is created for each scale fac- proaches. tor. Considering that fractional scale factors are often used, we need an economical way to store and retrieve networks. 4.1. The Deeper, the Better For this reason, we also train a multi-scale model. With Convolutional neural networks exploit spatially-local this approach, parameters are shared across all predefined correlation by enforcing a local connectivity pattern be- scale factors. Training a multi-scale model is straightfor- tween neurons of adjacent layers [1]. In other words, hidden ward. Training datasets for several specified scales are com- units in layer m take as input a subset of units in layer m−1. bined into one big dataset. They form spatially contiguous receptive fields. Data preparation is similar to SRCNN [5] with some dif- Each hidden unit is unresponsive to variations outside of ferences. Input patch size is now equal to the size of the the receptive field with respect to the input. The architecture receptive field and images are divided into sub-images with thus ensures that the learned filters produce the strongest no overlap. A mini-batch consists of 64 sub-images, where response to a spatially local input pattern. sub-images from different scales can be in the same batch. However, stacking many such layers leads to filters that We implement our model using the MatConvNet1 pack- become increasingly global (i.e. responsive to a larger re- age [23]. gion of pixel space). In other words, a filter of very large 1 support can be effectively decomposed into a series of small

5. 37.1 33.3 31 37 33.2 30.9 33.1 36.9 30.8 33 PSNR (dB) PSNR (dB) PSNR (dB) 36.8 30.7 32.9 36.7 30.6 32.8 36.6 30.5 32.7 36.5 32.6 30.4 36.4 32.5 30.3 5 10 15 20 5 10 15 20 5 10 15 20 Depth Depth Depth (a) Test Scale Factor 2 (b) Test Scale Factor 3 (c) Test Scale Factor 4 Figure 3: Depth vs Performance 38 38 38 36 36 36 34 34 34 32 32 32 30 PSNR (dB) PSNR (dB) PSNR (dB) 30 28 30 28 26 28 26 24 26 24 22 Residual Residual Residual Non-Residual 24 Non-Residual 22 Non-Residual 20 Bicubic Bicubic Bicubic 18 22 20 0 20 40 60 80 0 20 40 60 80 0 20 40 60 80 Epochs Epochs Epochs (a) Initial learning rate 0.1 (b) Initial learning rate 0.01 (c) Initial learning rate 0.001 Figure 4: Performance curve for residual and non-residual networks. Two networks are tested under ‘Set5’ dataset with scale factor 2. Residual networks quickly reach state-of-the-art performance within a few epochs, whereas non-residual networks (which models high- resolution image directly) take many epochs to reach maximum performance. Moreover, the final accuracy is higher for residual networks. filters. [19]. In this work, we use filters of the same size, 3×3, for all We now experimentally show that very deep networks layers. For the first layer, the receptive field is of size 3×3. significantly improve SR performance. We train and test For the next layers, the size of the receptive field increases networks of depth ranging from 5 to 20 (only counting by 2 in both height and width. For depth D network, the weight layers excluding nonlinearity layers). In Figure 3, receptive field has size (2D + 1) × (2D + 1). Its size is we show the results. In most cases, performance increases proportional to the depth. as depth increases. As depth increases, performance im- In the task of SR, this corresponds to the amount of proves rapidly. contextual information that can be exploited to infer high- 4.2. Residual-Learning frequency components. A large receptive field means the network can use more context to predict image details. As As we already have a low-resolution image as the in- SR is an ill-posed inverse problem, collecting and analyz- put, predicting high-frequency components is enough for ing more neighbor pixels give more clues. For example, if the purpose of SR. Although the concept of predicting resid- there are some image patterns entirely contained in a recep- uals has been used in previous methods [21, 22, 26], it has tive field, it is plausible that this pattern is recognized and not been studied in the context of deep-learning-based SR used to super-resolve the image. framework. In addition, very deep networks can exploit high nonlin- In this work, we have proposed a network structure that earities. We use 19 rectified linear units and our networks learns residual images. We now study the effect of this mod- can model very complex functions with moderate number ification to a standard CNN structure in detail. of channels (neurons). The advantages of making a thin First, we find that this residual network converges much deep network is well explained in Simonyan and Zisserman faster. Two networks are compared experimentally: the

6. Test / Train ×2 ×3 ×4 ×2,3 ×2,4 ×3,4 ×2,3,4 Bicubic ×2 37.10 30.05 28.13 37.09 37.03 32.43 37.06 33.66 ×3 30.42 32.89 30.50 33.22 31.20 33.24 33.27 30.39 ×4 28.43 28.73 30.84 28.70 30.86 30.94 30.95 28.42 Table 2: Scale Factor Experiment. Several models are trained with different scale sets. Quantitative evaluation (PSNR) on dataset ‘Set5’ is provided for scale factors 2,3 and 4. Red color indicates that test scale is included during training. Models trained with multiple scales perform well on the trained scales. ×1.5 ×1.5 ×2 ×2 ×2.5 ×2.5 ×3 ×3 ×3.5 ×3.5 ×4 ×4 Figure 5: (Top) Our results using a single network for all scale factors. Super-resolved images over all scales are clean and sharp. (Bottom) Results of Dong et al. [5] (×3 model used for all scales). Result images are not visually pleasing. To handle multiple scales, existing methods require multiple networks. residual network and the standard non-residual network. ple scales. Many SR processes for different scales can be We use depth 10 (weight layers) and scale factor 2. Perfor- executed with our multi-scale machine with much smaller mance curves for various learning rates are shown in Figure capacity than that of single-scale machines combined. 4. All use the same learning rate scheduling mechanism that We start with an interesting experiment as follows: we has been mentioned above. train our network with a single scale factor strain and it is Second, at convergence, the residual network shows su- tested under another scale factor stest . Here, factors 2,3 and perior performance. In Figure 4, residual networks give 4 that are widely used in SR comparisons are considered. higher PSNR when training is done. Possible pairs (strain ,stest ) are tried for the dataset ‘Set5’ Another remark is that if small learning rates are used, [15]. Experimental results are summarized in Table 2. networks do not converge in the given number of epochs. If Performance is degraded if strain = stest . For scale factor initial learning rate 0.1 is used, PSNR of a residual-learning 2, the model trained with factor 2 gives PSNR of 37.10 (in network reaches 36.90 within 10 epochs. But if 0.001 is dB), whereas models trained with factor 3 and 4 give 30.05 used instead, the network never reaches the same level of and 28.13, respectively. A network trained over single-scale performance (its performance is 36.52 after 80 epochs). In data is not capable of handling other scales. In many tests, a similar manner, residual and non-residual networks show it is even worse than bicubic interpolation, the method used dramatic performance gaps after 10 epochs (36.90 vs. 27.42 for generating the input image. for rate 0.1). We now test if a model trained with scale augmentation In short, this simple modification to a standard non- is capable of performing SR at multiple scale factors. The residual network structure is very powerful and one can ex- same network used above is trained with multiple scale fac- plore the validity of the idea in other image restoration prob- tors strain = {2, 3, 4}. In addition, we experiment with the lems where input and output images are highly correlated. cases strain = {2, 3}, {2, 4}, {3, 4} for more comparisons. We observe that the network copes with any scale used 4.3. Single Model for Multiple Scales during training. When strain = {2, 3, 4} (×2, 3, 4 in Ta- Scale augmentation during training is a key technique to ble 2), its PSNR for each scale is comparable to those equip a network with super-resolution machines of multi- achieved from the corresponding result of single-scale net-

7. Ground Truth A+ [22] RFL [18] SelfEx [11] SRCNN [5] VDSR (Ours) (PSNR, SSIM) (22.92, 0.7379) (22.90, 0.7332) (23.00, 0.7439) (23.15, 0.7487) (23.50, 0.7777) Figure 6: Super-resolution results of “148026” (B100) with scale factor ×3. VDSR recovers sharp lines. n Ground Truth A+ [22] RFL [18] SelfEx [11] SRCNN [5] VDSR (Ours) (PSNR, SSIM) (27.08, 0.7514) (27.08, 0.7508) (27.02, 0.7513) (27.16, 0.7545) (27.32, 0.7606) Figure 7: Super-resolution results of “38092” (B100) with scale factor ×3. The horn in the image is sharp in the result of VDSR. Bicubic A+ [22] RFL [18] SelfEx [11] SRCNN [5] VDSR (Ours) Dataset Scale PSNR/SSIM/time PSNR/SSIM/time PSNR/SSIM/time PSNR/SSIM/time PSNR/SSIM/time PSNR/SSIM/time ×2 33.66/0.9299/0.00 36.54/0.9544/0.58 36.54/0.9537/0.63 36.49/0.9537/45.78 36.66/0.9542/2.19 37.53/0.9587/0.13 Set5 ×3 30.39/0.8682/0.00 32.58/0.9088/0.32 32.43/0.9057/0.49 32.58/0.9093/33.44 32.75/0.9090/2.23 33.66/0.9213/0.13 ×4 28.42/0.8104/0.00 30.28/0.8603/0.24 30.14/0.8548/0.38 30.31/0.8619/29.18 30.48/0.8628/2.19 31.35/0.8838/0.12 ×2 30.24/0.8688/0.00 32.28/0.9056/0.86 32.26/0.9040/1.13 32.22/0.9034/105.00 32.42/0.9063/4.32 33.03/0.9124/0.25 Set14 ×3 27.55/0.7742/0.00 29.13/0.8188/0.56 29.05/0.8164/0.85 29.16/0.8196/74.69 29.28/0.8209/4.40 29.77/0.8314/0.26 ×4 26.00/0.7027/0.00 27.32/0.7491/0.38 27.24/0.7451/0.65 27.40/0.7518/65.08 27.49/0.7503/4.39 28.01/0.7674/0.25 ×2 29.56/0.8431/0.00 31.21/0.8863/0.59 31.16/0.8840/0.80 31.18/0.8855/60.09 31.36/0.8879/2.51 31.90/0.8960/0.16 B100 ×3 27.21/0.7385/0.00 28.29/0.7835/0.33 28.22/0.7806/0.62 28.29/0.7840/40.01 28.41/0.7863/2.58 28.82/0.7976/0.21 ×4 25.96/0.6675/0.00 26.82/0.7087/0.26 26.75/0.7054/0.48 26.84/0.7106/35.87 26.90/0.7101/2.51 27.29/0.7251/0.21 ×2 26.88/0.8403/0.00 29.20/0.8938/2.96 29.11/0.8904/3.62 29.54/0.8967/663.98 29.50/0.8946/22.12 30.76/0.9140/0.98 Urban100 ×3 24.46/0.7349/0.00 26.03/0.7973/1.67 25.86/0.7900/2.48 26.44/0.8088/473.60 26.24/0.7989/19.35 27.14/0.8279/1.08 ×4 23.14/0.6577/0.00 24.32/0.7183/1.21 24.19/0.7096/1.88 24.79/0.7374/394.40 24.52/0.7221/18.46 25.18/0.7524/1.06 Table 3: Average PSNR/SSIM for scale factor ×2, ×3 and ×4 on datasets Set5, Set14, B100 and Urban100. Red color indicates the best performance and blue color indicates the second best performance. 37.06 vs. 37.10 (×2), 33.27 vs. 32.89 (×3), 30.95 the learning rate was decreased 3 times, and the learning is vs. 30.86 (×4). stopped after 80 epochs. Training takes roughly 4 hours on Another pattern is that for large scales (×3, 4), our multi- GPU Titan Z. scale network outperforms single-scale network: our model (×2, 3), (×3, 4) and (×2, 3, 4) give PSNRs 33.22, 33.24 5.3. Benchmark and 33.27 for test scale 3, respectively, whereas (×3) gives For benchmark, we follow the publicly available frame- 32.89. Similarly, (×2, 4), (×3, 4) and (×2, 3, 4) give 30.86, work of Huang et al. [21]. It enables the comparison of 30.94 and 30.95 (vs. 30.84 by ×4 model), respectively. many state-of-the-art results with the same evaluation pro- From this, we observe that training multiple scales boosts cedure. the performance for large scales. The framework applies bicubic interpolation to color components of an image and sophisticated models to lumi- 5. Experimental Results nance components as in other methods [4], [9], [26]. This is In this section, we evaluate the performance of our because human vision is more sensitive to details in inten- method on several datasets. We first describe datasets used sity than in color. for training and testing our method. Next, parameters nec- This framework crops pixels near image boundary. For essary for training are given. our method, this procedure is unnecessary as our network After outlining our experimental setup, we compare our outputs the full-sized image. For fair comparison, however, method with several state-of-the-art SISR methods. we also crop pixels to the same amount. 5.1. Datasets for Training and Testing 5.4. Comparisons with State-of-the-Art Methods Training dataset Different learning-based methods use We provide quantitative and qualitative comparisons. different training images. For example, RFL [18] has two Compared methods are A+ [22], RFL[18], SelfEx [11] and methods, where the first one uses 91 images from Yang et al. SRCNN [5]. In Table 3, we provide a summary of quantita- [25] and the second one uses 291 images with the addition tive evaluation on several datasets. Our methods outperform of 200 images from Berkeley Segmentation Dataset [16]. all previous methods in these datasets. Moreover, our meth- SRCNN [6] uses a very large ImageNet dataset. ods are relatively fast. The public code of SRCNN based We use 291 images as in [18] for benchmark with other on a CPU implementation is slower than the code used by methods in this section. In addition, data augmentation (ro- Dong et. al [6] in their paper based on a GPU implementa- tation or flip) is used. For results in previous sections, we tion. used 91 images to train network fast, so performances can In Figures 6 and 7, we compare our method with top- be slightly different. performing methods. In Figure 6, only our method perfectly Test dataset For benchmark, we use four datasets. reconstructs the line in the middle. Similarly, in Figure 7, Datasets ‘Set5’ [15] and ‘Set14’ [26] are often used for contours are clean and vivid in our method whereas they are benchmark in other works [22, 21, 5]. Dataset ‘Urban100’, severely blurred or distorted in other methods. a dataset of urban images recently provided by Huang et al. [11], is very interesting as it contains many challeng- 6. Conclusion ing images failed by many of the existing methods. Finally, dataset ‘B100’, natural images in the Berkeley Segmenta- In this work, we have presented a super-resolution tion Dataset used in Timofte et al. [22] and Yang and Yang method using very deep networks. Training a very deep [24] for benchmark, is also employed. network is hard due to a slow convergence rate. We use residual-learning and extremely high learning rates to opti- 5.2. Training Parameters mize a very deep network fast. Convergence speed is max- imized and we use gradient clipping to ensure the train- We provide parameters used to train our final model. We ing stability. We have demonstrated that our method out- use a network of depth 20. Training uses batches of size 64. performs the existing method by a large margin on bench- Momentum and weight decay parameters are set to 0.9 and marked images. We believe our approach is readily appli- 0.0001, respectively. cable to other image restoration problems such as denoising For weight initialization, we use the method described in and compression artifact removal. He et al. [10]. This is a theoretically sound procedure for networks utilizing rectified linear units (ReLu). We train all experiments over 80 epochs (9960 iterations References with batch size 64). Learning rate was initially set to 0.1 and [1] Y. Bengio, I. J. Goodfellow, and A. Courville. Deep learning. then decreased by a factor of 10 every 20 epochs. In total, Book in preparation for MIT Press, 2015. 4

9. [2] Y. Bengio, P. Simard, and P. Frasconi. Learning long-term [22] R. Timofte, V. De Smet, and L. Van Gool. A+: Adjusted dependencies with gradient descent is difficult. Neural Net- anchored neighborhood regression for fast super-resolution. works, IEEE Transactions on, 5(2):157–166, 1994. 3, 4 In ACCV, 2014. 1, 2, 5, 7, 8 [3] M. Bevilacqua, A. Roumy, C. Guillemot, and M.-L. [23] A. Vedaldi and K. Lenc. Matconvnet – convolutional neural Morel. Super-resolution using neighbor embedding of back- networks for matlab. CoRR, abs/1412.4564, 2014. 4 projection residuals. In Digital Signal Processing (DSP), [24] C.-Y. Yang and M.-H. Yang. Fast direct super-resolution by 2013 18th International Conference on, pages 1–8. IEEE, simple functions. In ICCV, 2013. 8 2013. 2 [25] J. Yang, J. Wright, T. S. Huang, and Y. Ma. Image super- [4] H. Chang, D.-Y. Yeung, and Y. Xiong. Super-resolution resolution via sparse representation. TIP, 2010. 1, 8 through neighbor embedding. In CVPR, 2004. 1, 8 [26] R. Zeyde, M. Elad, and M. Protter. On single image scale-up [5] C. Dong, C. C. Loy, K. He, and X. Tang. Learning a deep using sparse-representations. In Curves and Surfaces, pages convolutional network for image super-resolution. In ECCV. 711–730. Springer, 2012. 1, 5, 8 2014. 4, 6, 7, 8 [6] C. Dong, C. C. Loy, K. He, and X. Tang. Image super- resolution using deep convolutional networks. TPAMI, 2015. 1, 2, 3, 4, 8 [7] C. E. Duchon. Lanczos filtering in one and two dimensions. Journal of Applied Meteorology, 18(8):1016–1022, 1979. 1 [8] W. T. Freeman, E. C. Pasztor, and O. T. Carmichael. Learn- ing low-level vision. International journal of computer vi- sion, 40(1):25–47, 2000. 1 [9] D. Glasner, S. Bagon, and M. Irani. Super-resolution from a single image. In ICCV, 2009. 1, 8 [10] K. He, X. Zhang, S. Ren, and J. Sun. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. CoRR, abs/1502.01852, 2015. 8 [11] J.-B. Huang, A. Singh, and N. Ahuja. Single image super- resolution using transformed self-exemplars. In CVPR, 2015. 7, 8 [12] M. Irani and S. Peleg. Improving resolution by image reg- istration. CVGIP: Graphical models and image processing, 53(3):231–239, 1991. 1 [13] K. I. Kim and Y. Kwon. Single-image super-resolution using sparse regression and natural image prior. TPAMI, 2010. 1 [14] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient- based learning applied to document recognition. Proceed- ings of the IEEE, 86(11):2278–2324, 1998. 3 [15] C. G. Marco Bevilacqua, Aline Roumy and M.-L. A. Morel. Low-complexity single-image super-resolution based on nonnegative neighbor embedding. In BMVC, 2012. 1, 2, 6, 8 [16] D. Martin, C. Fowlkes, D. Tal, and J. Malik. A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecologi- cal statistics. In ICCV, 2001. 8 [17] R. Pascanu, T. Mikolov, and Y. Bengio. On the difficulty of training recurrent neural networks. In ICML, 2013. 4 [18] S. Schulter, C. Leistner, and H. Bischof. Fast and accu- rate image upscaling with super-resolution forests. In CVPR, 2015. 1, 7, 8 [19] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In ICLR, 2015. 1, 2, 5 [20] J. Sun, Z. Xu, and H.-Y. Shum. Image super-resolution using gradient profile prior. In CVPR, 2008. 1 [21] R. Timofte, V. De, and L. V. Gool. Anchored neighborhood regression for fast example-based super-resolution. In ICCV, 2013. 1, 2, 5, 8