# Pixel Recurrent Neural Networks

## 展开查看详情

1. Pixel Recurrent Neural Networks A¨aron van den Oord AVDNOORD @ GOOGLE . COM Nal Kalchbrenner NALK @ GOOGLE . COM Koray Kavukcuoglu KORAYK @ GOOGLE . COM Google DeepMind arXiv:1601.06759v3 [cs.CV] 19 Aug 2016 Abstract occluded completions original Modeling the distribution of natural images is a landmark problem in unsupervised learning. This task requires an image model that is at once expressive, tractable and scalable. We present a deep neural network that sequentially predicts the pixels in an image along the two spatial dimensions. Our method models the dis- crete probability of the raw pixel values and en- Figure 1. Image completions sampled from a PixelRNN. codes the complete set of dependencies in the image. Architectural novelties include fast two- eling is building complex and expressive models that are dimensional recurrent layers and an effective use also tractable and scalable. This trade-off has resulted in of residual connections in deep recurrent net- a large variety of generative models, each having their ad- works. We achieve log-likelihood scores on nat- vantages. Most work focuses on stochastic latent variable ural images that are considerably better than the models such as VAE’s (Rezende et al., 2014; Kingma & previous state of the art. Our main results also Welling, 2013) that aim to extract meaningful representa- provide benchmarks on the diverse ImageNet tions, but often come with an intractable inference step that dataset. Samples generated from the model ap- can hinder their performance. pear crisp, varied and globally coherent. One effective approach to tractably model a joint distribu- tion of the pixels in the image is to cast it as a product of conditional distributions; this approach has been adopted in 1. Introduction autoregressive models such as NADE (Larochelle & Mur- ray, 2011) and fully visible neural networks (Neal, 1992; Generative image modeling is a central problem in unsu- Bengio & Bengio, 2000). The factorization turns the joint pervised learning. Probabilistic density models can be used modeling problem into a sequence problem, where one for a wide variety of tasks that range from image compres- learns to predict the next pixel given all the previously gen- sion and forms of reconstruction such as image inpainting erated pixels. But to model the highly nonlinear and long- (e.g., see Figure 1) and deblurring, to generation of new range correlations between pixels and the complex condi- images. When the model is conditioned on external infor- tional distributions that result, a highly expressive sequence mation, possible applications also include creating images model is necessary. based on text descriptions or simulating future frames in a planning task. One of the great advantages in generative Recurrent Neural Networks (RNN) are powerful models modeling is that there are practically endless amounts of that offer a compact, shared parametrization of a series of image data available to learn from. However, because im- conditional distributions. RNNs have been shown to excel ages are high dimensional and highly structured, estimating at hard sequence problems ranging from handwriting gen- the distribution of natural images is extremely challenging. eration (Graves, 2013), to character prediction (Sutskever et al., 2011) and to machine translation (Kalchbrenner & One of the most important obstacles in generative mod- Blunsom, 2013). A two-dimensional RNN has produced Proceedings of the 33 rd International Conference on Machine very promising results in modeling grayscale images and Learning, New York, NY, USA, 2016. JMLR: W&CP volume textures (Theis & Bethge, 2015). 48. Copyright 2016 by the author(s). In this paper we advance two-dimensional RNNs and ap-

2. Pixel Recurrent Neural Networks x1 xn x1 xn The contributions of the paper are as follows. In Section 3 we design two types of PixelRNNs corresponding to the R G B xi xi Mask B two types of LSTM layers; we describe the purely convo- R G B lutional PixelCNN that is our fastest architecture; and we Mask A design a Multi-Scale version of the PixelRNN. In Section 5 xn2 x n2 Context R G B we show the relative benefits of using the discrete softmax Context Multi-scale context distribution in our models and of adopting residual connec- tions for the LSTM layers. Next we test the models on Figure 2. Left: To generate pixel xi one conditions on all the pre- MNIST and on CIFAR-10 and show that they obtain log- viously generated pixels left and above of xi . Center: To gen- likelihood scores that are considerably better than previous erate a pixel in the multi-scale case we can also condition on the results. We also provide results for the large-scale Ima- subsampled image pixels (in light blue). Right: Diagram of the geNet dataset resized to both 32 × 32 and 64 × 64 pixels; connectivity inside a masked convolution. In the first layer, each to our knowledge likelihood values from generative models of the RGB channels is connected to previous channels and to the have not previously been reported on this dataset. Finally, context, but is not connected to itself. In subsequent layers, the channels are also connected to themselves. we give a qualitative evaluation of the samples generated from the PixelRNNs. ply them to large-scale modeling of natural images. The 2. Model resulting PixelRNNs are composed of up to twelve, fast Our aim is to estimate a distribution over natural images two-dimensional Long Short-Term Memory (LSTM) lay- that can be used to tractably compute the likelihood of im- ers. These layers use LSTM units in their state (Hochreiter ages and to generate new ones. The network scans the im- & Schmidhuber, 1997; Graves & Schmidhuber, 2009) and age one row at a time and one pixel at a time within each adopt a convolution to compute at once all the states along row. For each pixel it predicts the conditional distribution one of the spatial dimensions of the data. We design two over the possible pixel values given the scanned context. types of these layers. The first type is the Row LSTM layer Figure 2 illustrates this process. The joint distribution over where the convolution is applied along each row; a similar the image pixels is factorized into a product of conditional technique is described in (Stollenga et al., 2015). The sec- distributions. The parameters used in the predictions are ond type is the Diagonal BiLSTM layer where the convolu- shared across all pixel positions in the image. tion is applied in a novel fashion along the diagonals of the image. The networks also incorporate residual connections To capture the generation process, Theis & Bethge (2015) (He et al., 2015) around LSTM layers; we observe that this propose to use a two-dimensional LSTM network (Graves helps with training of the PixelRNN for up to twelve layers & Schmidhuber, 2009) that starts at the top left pixel and of depth. proceeds towards the bottom right pixel. The advantage of the LSTM network is that it effectively handles long-range We also consider a second, simplified architecture which dependencies that are central to object and scene under- shares the same core components as the PixelRNN. We ob- standing. The two-dimensional structure ensures that the serve that Convolutional Neural Networks (CNN) can also signals are well propagated both in the left-to-right and top- be used as sequence model with a fixed dependency range, to-bottom directions. by using Masked convolutions. The PixelCNN architec- ture is a fully convolutional network of fifteen layers that In this section we first focus on the form of the distribution, preserves the spatial resolution of its input throughout the whereas the next section will be devoted to describing the layers and outputs a conditional distribution at each loca- architectural innovations inside PixelRNN. tion. 2.1. Generating an Image Pixel by Pixel Both PixelRNN and PixelCNN capture the full generality of pixel inter-dependencies without introducing indepen- The goal is to assign a probability p(x) to each image x dence assumptions as in e.g., latent variable models. The formed of n × n pixels. We can write the image x as a one- dependencies are also maintained between the RGB color dimensional sequence x1 , ..., xn2 where pixels are taken values within each individual pixel. Furthermore, in con- from the image row by row. To estimate the joint distri- trast to previous approaches that model the pixels as con- bution p(x) we write it as the product of the conditional tinuous values (e.g., Theis & Bethge (2015); Gregor et al. distributions over the pixels: (2014)), we model the pixels as discrete values using a multinomial distribution implemented with a simple soft- n2 max layer. We observe that this approach gives both repre- p(x) = p(xi |x1 , ..., xi−1 ) (1) sentational and training advantages for our models. i=1

3. Pixel Recurrent Neural Networks The value p(xi |x1 , ..., xi−1 ) is the probability of the i-th pixel xi given all the previous pixels x1 , ..., xi−1 . The gen- eration proceeds row by row and pixel by pixel. Figure 2 (Left) illustrates the conditioning scheme. Each pixel xi is in turn jointly determined by three values, one for each of the color channels Red, Green and Blue (RGB). We rewrite the distribution p(xi |x<i ) as the fol- Figure 3. In the Diagonal BiLSTM, to allow for parallelization lowing product: along the diagonals, the input map is skewed by offseting each row by one position with respect to the previous row. When the p(xi,R |x<i )p(xi,G |x<i , xi,R )p(xi,B |x<i , xi,R , xi,G ) (2) spatial layer is computed left to right and column by column, the output map is shifted back into the original size. The convolution Each of the colors is thus conditioned on the other channels uses a kernel of size 2 × 1. as well as on all the previously generated pixels. Note that during training and evaluation the distributions dimensional convolution has size k × 1 where k ≥ 3; the over the pixel values are computed in parallel, while the larger the value of k the broader the context that is captured. generation of an image is sequential. The weight sharing in the convolution ensures translation invariance of the computed features along each row. 2.2. Pixels as Discrete Variables The computation proceeds as follows. An LSTM layer has Previous approaches use a continuous distribution for the an input-to-state component and a recurrent state-to-state values of the pixels in the image (e.g. Theis & Bethge component that together determine the four gates inside the (2015); Uria et al. (2014)). By contrast we model p(x) as LSTM core. To enhance parallelization in the Row LSTM a discrete distribution, with every conditional distribution the input-to-state component is first computed for the entire in Equation 2 being a multinomial that is modeled with a two-dimensional input map; for this a k × 1 convolution is softmax layer. Each channel variable xi,∗ simply takes one used to follow the row-wise orientation of the LSTM itself. of 256 distinct values. The discrete distribution is represen- The convolution is masked to include only the valid context tationally simple and has the advantage of being arbitrarily (see Section 3.4) and produces a tensor of size 4h × n × n, multimodal without prior on the shape (see Fig. 6). Exper- representing the four gate vectors for each position in the imentally we also find the discrete distribution to be easy input map, where h is the number of output feature maps. to learn and to produce better performance compared to a continuous distribution (Section 5). To compute one step of the state-to-state component of the LSTM layer, one is given the previous hidden and cell states hi−1 and ci−1 , each of size h × n × 1. The new 3. Pixel Recurrent Neural Networks hidden and cell states hi , ci are obtained as follows: In this section we describe the architectural components that compose the PixelRNN. In Sections 3.1 and 3.2, we describe the two types of LSTM layers that use convolu- tions to compute at once the states along one of the spatial [oi , fi , ii , gi ] = σ(Kss hi−1 + Kis xi ) dimensions. In Section 3.3 we describe how to incorporate ci = fi ci−1 + ii gi (3) residual connections to improve the training of a PixelRNN hi = oi tanh(ci ) with many LSTM layers. In Section 3.4 we describe the softmax layer that computes the discrete joint distribution of the colors and the masking technique that ensures the proper conditioning scheme. In Section 3.5 we describe the where xi of size h × n × 1 is row i of the input map, and PixelCNN architecture. Finally in Section 3.6 we describe represents the convolution operation and the element- the multi-scale architecture. wise multiplication. The weights Kss and Kis are the kernel weights for the state-to-state and the input-to-state components, where the latter is precomputed as described 3.1. Row LSTM above. In the case of the output, forget and input gates oi , The Row LSTM is a unidirectional layer that processes fi and ii , the activation σ is the logistic sigmoid function, the image row by row from top to bottom computing fea- whereas for the content gate gi , σ is the tanh function. tures for a whole row at once; the computation is per- Each step computes at once the new state for an entire row formed with a one-dimensional convolution. For a pixel of the input map. Because the Row LSTM has a triangular xi the layer captures a roughly triangular context above the receptive field (Figure 4), it is unable to capture the entire pixel as shown in Figure 4 (center). The kernel of the one- available context.

4. Pixel Recurrent Neural Networks 3.3. Residual Connections We train PixelRNNs of up to twelve layers of depth. As a means to both increase convergence speed and propagate signals more directly through the network, we deploy resid- ual connections (He et al., 2015) from one LSTM layer to the next. Figure 5 shows a diagram of the residual blocks. The input map to the PixelRNN LSTM layer has 2h fea- tures. The input-to-state component reduces the number of PixelCNN Row LSTM Diagonal BiLSTM features by producing h features per gate. After applying Figure 4. Visualization of the input-to-state and state-to-state the recurrent layer, the output map is upsampled back to 2h mappings for the three proposed architectures. features per position via a 1 × 1 convolution and the input map is added to the output map. This method is related to previous approaches that use gating along the depth of the recurrent network (Kalchbrenner et al., 2015; Zhang et al., 2016), but has the advantage of not requiring additional 3.2. Diagonal BiLSTM gates. Apart from residual connections, one can also use The Diagonal BiLSTM is designed to both parallelize the learnable skip connections from each layer to the output. computation and to capture the entire available context for In the experiments we evaluate the relative effectiveness of any image size. Each of the two directions of the layer residual and layer-to-output skip connections. scans the image in a diagonal fashion starting from a cor- ner at the top and reaching the opposite corner at the bot- tom. Each step in the computation computes at once the + 2h ReLU - 1x1 Conv + 2h 1x1 Conv LSTM state along a diagonal in the image. Figure 4 (right) h illustrates the computation and the resulting receptive field. ReLU - 3x3 Conv h h The diagonal computation proceeds as follows. We first 2h 2h ReLU - 1x1 Conv LSTM skew the input map into a space that makes it easy to ap- ply convolutions along diagonals. The skewing operation Figure 5. Residual blocks for a PixelCNN (left) and PixelRNNs. offsets each row of the input map by one position with re- spect to the previous row, as illustrated in Figure 3; this 3.4. Masked Convolution results in a map of size n × (2n − 1). At this point we can compute the input-to-state and state-to-state components of The h features for each input position at every layer in the the Diagonal BiLSTM. For each of the two directions, the network are split into three parts, each corresponding to input-to-state component is simply a 1 × 1 convolution K is one of the RGB channels. When predicting the R chan- that contributes to the four gates in the LSTM core; the op- nel for the current pixel xi , only the generated pixels left eration generates a 4h × n × n tensor. The state-to-state and above of xi can be used as context. When predicting recurrent component is then computed with a column-wise the G channel, the value of the R channel can also be used convolution K ss that has a kernel of size 2 × 1. The step as context in addition to the previously generated pixels. takes the previous hidden and cell states, combines the con- Likewise, for the B channel, the values of both the R and tribution of the input-to-state component and produces the G channels can be used. To restrict connections in the net- next hidden and cell states, as defined in Equation 3. The work to these dependencies, we apply a mask to the input- output feature map is then skewed back into an n × n map to-state convolutions and to other purely convolutional lay- by removing the offset positions. This computation is re- ers in a PixelRNN. peated for each of the two directions. Given the two out- We use two types of masks that we indicate with mask A put maps, to prevent the layer from seeing future pixels, and mask B, as shown in Figure 2 (Right). Mask A is ap- the right output map is then shifted down by one row and plied only to the first convolutional layer in a PixelRNN added to the left output map. and restricts the connections to those neighboring pixels Besides reaching the full dependency field, the Diagonal and to those colors in the current pixels that have already BiLSTM has the additional advantage that it uses a con- been predicted. On the other hand, mask B is applied to volutional kernel of size 2 × 1 that processes a minimal all the subsequent input-to-state convolutional transitions amount of information at each step yielding a highly non- and relaxes the restrictions of mask A by also allowing the linear computation. Kernel sizes larger than 2 × 1 are not connection from a color to itself. The masks can be eas- particularly useful as they do not broaden the already global ily implemented by zeroing out the corresponding weights receptive field of the Diagonal BiLSTM. in the input-to-state convolutions after each update. Simi-

5. Pixel Recurrent Neural Networks PixelCNN Row LSTM Diagonal BiLSTM layer in the conditional PixelRNN, one simply maps the 7 × 7 conv mask A c × n × n conditioning map into a 4h × n × n map that is added to the input-to-state map of the corresponding layer; Multiple residual blocks: (see fig 5) this is performed using a 1 × 1 unmasked convolution. The Conv Row LSTM Diagonal BiLSTM larger n × n image is then generated as usual. 3 × 3 mask B i-s: 3 × 1 mask B i-s: 1 × 1 mask B s-s: 3 × 1 no mask s-s: 1 × 2 no mask 4. Specifications of Models ReLU followed by 1 × 1 conv, mask B (2 layers) 256-way Softmax for each RGB color (Natural images) In this section we give the specifications of the PixelRNNs or Sigmoid (MNIST) used in the experiments. We have four types of networks: the PixelRNN based on Row LSTM, the one based on Di- Table 1. Details of the architectures. In the LSTM architectures agonal BiLSTM, the fully convolutional one and the Multi- i-s and s-s stand for input-state and state-state convolutions. Scale one. Table 1 specifies each layer in the single-scale networks. lar masks have also been used in variational autoencoders The first layer is a 7 × 7 convolution that uses the mask of (Gregor et al., 2014; Germain et al., 2015). type A. The two types of LSTM networks then use a vari- able number of recurrent layers. The input-to-state con- 3.5. PixelCNN volution in this layer uses a mask of type B, whereas the The Row and Diagonal LSTM layers have a potentially state-to-state convolution is not masked. The PixelCNN unbounded dependency range within their receptive field. uses convolutions of size 3 × 3 with a mask of type B. This comes with a computational cost as each state needs The top feature map is then passed through a couple of to be computed sequentially. One simple workaround is layers consisting of a Rectified Linear Unit (ReLU) and a to make the receptive field large, but not unbounded. We 1×1 convolution. For the CIFAR-10 and ImageNet experi- can use standard convolutional layers to capture a bounded ments, these layers have 1024 feature maps; for the MNIST receptive field and compute features for all pixel positions experiment, the layers have 32 feature maps. Residual and at once. The PixelCNN uses multiple convolutional lay- layer-to-output connections are used across the layers of all ers that preserve the spatial resolution; pooling layers are three networks. not used. Masks are adopted in the convolutions to avoid The networks used in the experiments have the following seeing the future context; masks have previously also been hyperparameters. For MNIST we use a Diagonal BiLSTM used in non-convolutional models such as MADE (Ger- with 7 layers and a value of h = 16 (Section 3.3 and Figure main et al., 2015). Note that the advantage of paralleliza- 5 right). For CIFAR-10 the Row and Diagonal BiLSTMs tion of the PixelCNN over the PixelRNN is only available have 12 layers and a number of h = 128 units. The Pixel- during training or during evaluating of test images. The CNN has 15 layers and h = 128. For 32 × 32 ImageNet image generation process is sequential for both kinds of we adopt a 12 layer Row LSTM with h = 384 units and networks, as each sampled pixel needs to be given as input for 64 × 64 ImageNet we use a 4 layer Row LSTM with back into the network. h = 512 units; the latter model does not use residual con- nections. 3.6. Multi-Scale PixelRNN The Multi-Scale PixelRNN is composed of an uncondi- 5. Experiments tional PixelRNN and one or more conditional PixelRNNs. The unconditional network first generates in the standard In this section we describe our experiments and results. We way a smaller s×s image that is subsampled from the orig- begin by describing the way we evaluate and compare our inal image. The conditional network then takes the s × s results. In Section 5.2 we give details about the training. image as an additional input and generates a larger n × n Then we give results on the relative effectiveness of archi- image, as shown in Figure 2 (Middle). tectural components and our best results on the MNIST, CIFAR-10 and ImageNet datasets. The conditional network is similar to a standard PixelRNN, but each of its layers is biased with an upsampled version 5.1. Evaluation of the small s × s image. The upsampling and biasing pro- cesses are defined as follows. In the upsampling process, All our models are trained and evaluated on the log- one uses a convolutional network with deconvolutional lay- likelihood loss function coming from a discrete distribu- ers to construct an enlarged feature map of size c × n × n, tion. Although natural image data is usually modeled with where c is the number of features in the output map of the continuous distributions using density functions, we can upsampling network. Then, in the biasing process, for each compare our results with previous art in the following way.

6. Pixel Recurrent Neural Networks In the literature it is currently best practice to add real- In Figure 6 we show a few softmax activations from the valued noise to the pixel values to dequantize the data when model. Although we don’t embed prior information about using density functions (Uria et al., 2013). When uniform the meaning or relations of the 256 color categories, e.g. noise is added (with values in the interval [0, 1]), then the that pixel values 51 and 52 are neighbors, the distributions log-likelihoods of continuous and discrete models are di- predicted by the model are meaningful and can be multi- rectly comparable (Theis et al., 2015). In our case, we can modal, skewed, peaked or long tailed. Also note that values use the values from the discrete distribution as a piecewise- 0 and 255 often get a much higher probability as they are uniform continuous function that has a constant value for more frequent. Another advantage of the discrete distribu- every interval [i, i + 1], i = 1, 2, . . . 256. This correspond- tion is that we do not worry about parts of the distribution ing distribution will have the same log-likelihood (on data mass lying outside the interval [0, 255], which is something with added noise) as the original discrete distribution (on that typically happens with continuous distributions. discrete data). For MNIST we report the negative log-likelihood in nats as it is common practice in literature. For CIFAR-10 and ImageNet we report negative log-likelihoods in bits per di- mension. The total discrete log-likelihood is normalized by the dimensionality of the images (e.g., 32 × 32 × 3 = 3072 for CIFAR-10). These numbers are interpretable as the number of bits that a compression scheme based on this 0 0 50 100 150 200 250 255 0 0 50 100 150 200 250 255 model would need to compress every RGB color value (van den Oord & Schrauwen, 2014b; Theis et al., 2015); in practice there is also a small overhead due to arithmetic coding. 5.2. Training Details Our models are trained on GPUs using the Torch toolbox. 0 00 50 100 150 200 250 255 255 0 0 50 100 150 200 250 255 From the different parameter update rules tried, RMSProp gives best convergence performance and is used for all ex- Figure 6. Example softmax activations from the model. The top periments. The learning rate schedules were manually set left shows the distribution of the first pixel red value (first value for every dataset to the highest values that allowed fast con- to sample). vergence. The batch sizes also vary for different datasets. For smaller datasets such as MNIST and CIFAR-10 we use smaller batch sizes of 16 images as this seems to regularize 5.4. Residual Connections the models. For ImageNet we use as large a batch size as Another core component of the networks is residual con- allowed by the GPU memory; this corresponds to 64 im- nections. In Table 2 we show the results of having residual ages/batch for 32 × 32 ImageNet, and 32 images/batch for connections, having standard skip connections or having 64 × 64 ImageNet. Apart from scaling and centering the both, in the 12-layer CIFAR-10 Row LSTM model. We images at the input of the network, we don’t use any other see that using residual connections is as effective as using preprocessing or augmentation. For the multinomial loss skip connections; using both is also effective and preserves function we use the raw pixel color values as categories. the advantage. For all the PixelRNN models, we learn the initial recurrent state of the network. No skip Skip 5.3. Discrete Softmax Distribution No residual: 3.22 3.09 Apart from being intuitive and easy to implement, we find Residual: 3.07 3.06 that using a softmax on discrete pixel values instead of a mixture density approach on continuous pixel values gives Table 2. Effect of residual and skip connections in the Row LSTM network evaluated on the Cifar-10 validation set in bits/dim. better results. For the Row LSTM model with a softmax output distribution we obtain 3.06 bits/dim on the CIFAR- 10 validation set. For the same model with a Mixture of When using both the residual and skip connections, we see Conditional Gaussian Scale Mixtures (MCGSM) (Theis & in Table 3 that performance of the Row LSTM improves Bethge, 2015) we obtain 3.22 bits/dim. with increased depth. This holds for up to the 12 LSTM layers that we tried.

7. Pixel Recurrent Neural Networks Figure 7. Samples from models trained on CIFAR-10 (left) and ImageNet 32x32 (right) images. In general we can see that the models capture local spatial dependencies relatively well. The ImageNet model seems to be better at capturing more global structures than the CIFAR-10 model. The ImageNet model was larger and trained on much more data, which explains the qualitative difference in samples. # layers: 1 2 3 6 9 12 Model NLL Test NLL: 3.30 3.20 3.17 3.09 3.08 3.06 DBM 2hl [1]: ≈ 84.62 DBN 2hl [2]: ≈ 84.55 Table 3. Effect of the number of layers on the negative log likeli- NADE [3]: 88.33 hood evaluated on the CIFAR-10 validation set (bits/dim). EoNADE 2hl (128 orderings) [3]: 85.10 EoNADE-5 2hl (128 orderings) [4]: 84.68 DLGM [5]: ≈ 86.60 5.5. MNIST DLGM 8 leapfrog steps [6]: ≈ 85.51 DARN 1hl [7]: ≈ 84.13 Although the goal of our work was to model natural images MADE 2hl (32 masks) [8]: 86.64 on a large scale, we also tried our model on the binary ver- DRAW [9]: ≤ 80.97 sion (Salakhutdinov & Murray, 2008) of MNIST (LeCun PixelCNN: 81.30 et al., 1998) as it is a good sanity check and there is a lot Row LSTM: 80.54 of previous art on this dataset to compare with. In Table 4 Diagonal BiLSTM (1 layer, h = 32): 80.75 we report the performance of the Diagonal BiLSTM model Diagonal BiLSTM (7 layers, h = 16): 79.20 and that of previous published results. To our knowledge this is the best reported result on MNIST so far. Table 4. Test set performance of different models on MNIST in nats (negative log-likelihood). Prior results taken from [1] 5.6. CIFAR-10 (Salakhutdinov & Hinton, 2009), [2] (Murray & Salakhutdinov, 2009), [3] (Uria et al., 2014), [4] (Raiko et al., 2014), [5] (Rezende Next we test our models on the CIFAR-10 dataset et al., 2014), [6] (Salimans et al., 2015), [7] (Gregor et al., 2014), (Krizhevsky, 2009). Table 5 lists the results of our mod- [8] (Germain et al., 2015), [9] (Gregor et al., 2015). els and that of previously published approaches. All our results were obtained without data augmentation. For the proposed networks, the Diagonal BiLSTM has the best from the Diagonal BiLSTM. performance, followed by the Row LSTM and the Pixel- CNN. This coincides with the size of the respective recep- tive fields: the Diagonal BiLSTM has a global view, the 5.7. ImageNet Row LSTM has a partially occluded view and the Pixel- CNN sees the fewest pixels in the context. This suggests Although to our knowledge the are no published results on that effectively capturing a large receptive field is impor- the ILSVRC ImageNet dataset (Russakovsky et al., 2015) tant. Figure 7 (left) shows CIFAR-10 samples generated that we can compare our models with, we give our Ima-

8. Pixel Recurrent Neural Networks Figure 8. Samples from models trained on ImageNet 64x64 images. Left: normal model, right: multi-scale model. The single-scale model trained on 64x64 images is less able to capture global structure than the 32x32 model. The multi-scale model seems to resolve this problem. Although these models get similar performance in log-likelihood, the samples on the right do seem globally more coherent. occluded completions original Model NLL Test (Train) Uniform Distribution: 8.00 Multivariate Gaussian: 4.70 NICE [1]: 4.48 Deep Diffusion [2]: 4.20 Deep GMMs [3]: 4.00 RIDE [4]: 3.47 PixelCNN: 3.14 (3.08) Row LSTM: 3.07 (3.00) Diagonal BiLSTM: 3.00 (2.93) Table 5. Test set performance of different models on CIFAR-10 in bits/dim. For our models we give training performance in brack- ets. [1] (Dinh et al., 2014), [2] (Sohl-Dickstein et al., 2015), [3] Figure 9. Image completions sampled from a model that was (van den Oord & Schrauwen, 2014a), [4] personal communication trained on 32x32 ImageNet images. Note that diversity of the (Theis & Bethge, 2015). completions is high, which can be attributed to the log-likelihood loss function used in this generative model, as it encourages mod- els with high entropy. As these are sampled from the model, we Image size NLL Validation (Train) can easily generate millions of different completions. It is also 32x32: 3.86 (3.83) interesting to see that textures such as water, wood and shrubbery are also inputed relative well (see Figure 1). 64x64: 3.63 (3.57) Table 6. Negative log-likelihood performance on 32×32 and 64× likely resized with a different algorithm than the one we 64 ImageNet in bits/dim. used for ImageNet images. The ImageNet images are less blurry, which means neighboring pixels are less correlated geNet log-likelihood performance in Table 6 (without data to each other and thus less predictable. Because the down- augmentation). On ImageNet the current PixelRNNs do sampling method can influence the compression perfor- not appear to overfit, as we saw that their validation per- mance, we have made the used downsampled images avail- formance improved with size and depth. The main con- able1 . straint on model size are currently computation time and GPU memory. Figure 7 (right) shows 32 × 32 samples drawn from our model trained on ImageNet. Figure 8 shows 64 × 64 sam- Note that the ImageNet models are in general less com- ples from the same model with and without multi-scale pressible than the CIFAR-10 images. ImageNet has greater 1 variety of images, and the CIFAR-10 images were most http://image-net.org/small/download.php

9. Pixel Recurrent Neural Networks conditioning. Finally, we also show image completions Graves, Alex and Schmidhuber, J¨urgen. Offline handwrit- sampled from the model in Figure 9. ing recognition with multidimensional recurrent neural networks. In Advances in Neural Information Process- 6. Conclusion ing Systems, 2009. In this paper we significantly improve and build upon deep Gregor, Karol, Danihelka, Ivo, Mnih, Andriy, Blundell, recurrent neural networks as generative models for natural Charles, and Wierstra, Daan. Deep autoregressive net- images. We have described novel two-dimensional LSTM works. In Proceedings of the 31st International Confer- layers: the Row LSTM and the Diagonal BiLSTM, that ence on Machine Learning, 2014. scale more easily to larger datasets. The models were Gregor, Karol, Danihelka, Ivo, Graves, Alex, and Wierstra, trained to model the raw RGB pixel values. We treated the Daan. DRAW: A recurrent neural network for image pixel values as discrete random variables by using a soft- generation. Proceedings of the 32nd International Con- max layer in the conditional distributions. We employed ference on Machine Learning, 2015. masked convolutions to allow PixelRNNs to model full de- pendencies between the color channels. We proposed and He, Kaiming, Zhang, Xiangyu, Ren, Shaoqing, and Sun, evaluated architectural improvements in these models re- Jian. Deep residual learning for image recognition. arXiv sulting in PixelRNNs with up to 12 LSTM layers. preprint arXiv:1512.03385, 2015. We have shown that the PixelRNNs significantly improve Hochreiter, Sepp and Schmidhuber, J¨urgen. Long short- the state of the art on the MNIST and CIFAR-10 datasets. term memory. Neural computation, 1997. We also provide new benchmarks for generative image modeling on the ImageNet dataset. Based on the samples Kalchbrenner, Nal and Blunsom, Phil. Recurrent continu- and completions drawn from the models we can conclude ous translation models. In Proceedings of the 2013 Con- that the PixelRNNs are able to model both spatially local ference on Empirical Methods in Natural Language Pro- and long-range correlations and are able to produce images cessing, 2013. that are sharp and coherent. Given that these models im- prove as we make them larger and that there is practically Kalchbrenner, Nal, Danihelka, Ivo, and Graves, Alex. unlimited data available to train on, more computation and Grid long short-term memory. arXiv preprint larger models are likely to further improve the results. arXiv:1507.01526, 2015. Kingma, Diederik P and Welling, Max. Auto-encoding Acknowledgements variational bayes. arXiv preprint arXiv:1312.6114, The authors would like to thank Shakir Mohamed and Guil- 2013. laume Desjardins for helpful input on this paper and Lu- Krizhevsky, Alex. Learning multiple layers of features cas Theis, Alex Graves, Karen Simonyan, Lasse Espeholt, from tiny images. 2009. Danilo Rezende, Karol Gregor and Ivo Danihelka for in- sightful discussions. Larochelle, Hugo and Murray, Iain. The neural autore- gressive distribution estimator. The Journal of Machine References Learning Research, 2011. Bengio, Yoshua and Bengio, Samy. Modeling high- LeCun, Yann, Bottou, L´eon, Bengio, Yoshua, and Haffner, dimensional discrete data with multi-layer neural net- Patrick. Gradient-based learning applied to document works. pp. 400–406. MIT Press, 2000. recognition. Proceedings of the IEEE, 1998. Dinh, Laurent, Krueger, David, and Bengio, Yoshua. Murray, Iain and Salakhutdinov, Ruslan R. Evaluat- NICE: Non-linear independent components estimation. ing probabilities under high-dimensional latent variable arXiv preprint arXiv:1410.8516, 2014. models. In Advances in Neural Information Processing Systems, 2009. Germain, Mathieu, Gregor, Karol, Murray, Iain, and Neal, Radford M. Connectionist learning of belief net- Larochelle, Hugo. MADE: Masked autoencoder for dis- works. Artificial intelligence, 1992. tribution estimation. arXiv preprint arXiv:1502.03509, 2015. Raiko, Tapani, Li, Yao, Cho, Kyunghyun, and Bengio, Yoshua. Iterative neural autoregressive distribution es- Graves, Alex. Generating sequences with recurrent neural timator NADE-k. In Advances in Neural Information networks. arXiv preprint arXiv:1308.0850, 2013. Processing Systems, 2014.

10. Pixel Recurrent Neural Networks Rezende, Danilo J, Mohamed, Shakir, and Wierstra, Daan. the 31st International Conference on Machine Learning, Stochastic backpropagation and approximate inference 2014. in deep generative models. In Proceedings of the 31st International Conference on Machine Learning, 2014. van den Oord, A¨aron and Schrauwen, Benjamin. Factoring variations in natural images with deep gaussian mixture Russakovsky, Olga, Deng, Jia, Su, Hao, Krause, Jonathan, models. In Advances in Neural Information Processing Satheesh, Sanjeev, Ma, Sean, Huang, Zhiheng, Karpa- Systems, 2014a. thy, Andrej, Khosla, Aditya, Bernstein, Michael, Berg, Alexander C., and Fei-Fei, Li. ImageNet Large Scale van den Oord, A¨aron and Schrauwen, Benjamin. The Visual Recognition Challenge. International Journal of student-t mixture as a natural image patch prior with ap- Computer Vision (IJCV), 2015. plication to image compression. The Journal of Machine Learning Research, 2014b. Salakhutdinov, Ruslan and Hinton, Geoffrey E. Deep boltz- mann machines. In International Conference on Artifi- Zhang, Yu, Chen, Guoguo, Yu, Dong, Yao, Kaisheng, Khu- cial Intelligence and Statistics, 2009. danpur, Sanjeev, and Glass, James. Highway long short- term memory RNNs for distant speech recognition. In Salakhutdinov, Ruslan and Murray, Iain. On the quantita- Proceedings of the International Conference on Acous- tive analysis of deep belief networks. In Proceedings of tics, Speech and Signal Processing, 2016. the 25th international conference on Machine learning, 2008. Salimans, Tim, Kingma, Diederik P, and Welling, Max. Markov chain monte carlo and variational inference: Bridging the gap. Proceedings of the 32nd International Conference on Machine Learning, 2015. Sohl-Dickstein, Jascha, Weiss, Eric A., Maheswaranathan, Niru, and Ganguli, Surya. Deep unsupervised learning using nonequilibrium thermodynamics. Proceedings of the 32nd International Conference on Machine Learn- ing, 2015. Stollenga, Marijn F, Byeon, Wonmin, Liwicki, Marcus, and Schmidhuber, Juergen. Parallel multi-dimensional lstm, with application to fast biomedical volumetric im- age segmentation. In Advances in Neural Information Processing Systems 28. 2015. Sutskever, Ilya, Martens, James, and Hinton, Geoffrey E. Generating text with recurrent neural networks. In Pro- ceedings of the 28th International Conference on Ma- chine Learning, 2011. Theis, Lucas and Bethge, Matthias. Generative image mod- eling using spatial LSTMs. In Advances in Neural Infor- mation Processing Systems, 2015. Theis, Lucas, van den Oord, A¨aron, and Bethge, Matthias. A note on the evaluation of generative models. arXiv preprint arXiv:1511.01844, 2015. Uria, Benigno, Murray, Iain, and Larochelle, Hugo. RNADE: The real-valued neural autoregressive density- estimator. In Advances in Neural Information Processing Systems, 2013. Uria, Benigno, Murray, Iain, and Larochelle, Hugo. A deep and tractable density estimator. In Proceedings of

11. Pixel Recurrent Neural Networks Figure 10. Additional samples from a model trained on ImageNet 32x32 (right) images.