Image-to-Image Translation with Conditional Adversarial Networks

We investigate conditional adversarial networks as a general-purpose solution to image-to-image translation problems. These networks not only learn the mapping from input image to output image, but also learn a loss function to train this mapping. This makes it possible to apply the same generic approach to problems that traditionally would require very different loss formulations. We demonstrate that this approach is effective at synthesizing photos from label maps, reconstructing objects from edge maps, and colorizing images, among other tasks. As a community, we no longer hand-engineer our mapping functions,and this work suggests we can achieve reasonable results without hand-engineering our loss functions either.
展开查看详情

1. Image-to-Image Translation with Conditional Adversarial Networks Phillip Isola Jun-Yan Zhu Tinghui Zhou Alexei A. Efros Berkeley AI Research (BAIR) Laboratory University of California, Berkeley {isola,junyanz,tinghuiz,efros}@eecs.berkeley.edu arXiv:1611.07004v1 [cs.CV] 21 Nov 2016 Labels to Street Scene Labels to Facade BW to Color input output Aerial to Map input output input output Day to Night Edges to Photo input output input output input output Figure 1: Many problems in image processing, graphics, and vision involve translating an input image into a corresponding output image. These problems are often treated with application-specific algorithms, even though the setting is always the same: map pixels to pixels. Conditional adversarial nets are a general-purpose solution that appears to work well on a wide variety of these problems. Here we show results of the method on several. In each case we use the same architecture and objective, and simply train on different data. Abstract may be expressed in either English or French, a scene may be rendered as an RGB image, a gradient field, an edge map, We investigate conditional adversarial networks as a a semantic label map, etc. In analogy to automatic language general-purpose solution to image-to-image translation translation, we define automatic image-to-image translation problems. These networks not only learn the mapping from as the problem of translating one possible representation of input image to output image, but also learn a loss func- a scene into another, given sufficient training data (see Fig- tion to train this mapping. This makes it possible to apply ure 1). One reason language translation is difficult is be- the same generic approach to problems that traditionally cause the mapping between languages is rarely one-to-one would require very different loss formulations. We demon- – any given concept is easier to express in one language strate that this approach is effective at synthesizing photos than another. Similarly, most image-to-image translation from label maps, reconstructing objects from edge maps, problems are either many-to-one (computer vision) – map- and colorizing images, among other tasks. As a commu- ping photographs to edges, segments, or semantic labels, nity, we no longer hand-engineer our mapping functions, or one-to-many (computer graphics) – mapping labels or and this work suggests we can achieve reasonable results sparse user inputs to realistic images. Traditionally, each of without hand-engineering our loss functions either. these tasks has been tackled with separate, special-purpose machinery (e.g., [7, 15, 11, 1, 3, 37, 21, 26, 9, 42, 46]), despite the fact that the setting is always the same: predict Many problems in image processing, computer graphics, pixels from pixels. Our goal in this paper is to develop a and computer vision can be posed as “translating” an input common framework for all these problems. image into a corresponding output image. Just as a concept 1

2. The community has already taken significant steps in this sification or regression [26, 42, 17, 23, 46]. These for- direction, with convolutional neural nets (CNNs) becoming mulations treat the output space as “unstructured” in the the common workhorse behind a wide variety of image pre- sense that each output pixel is considered conditionally in- diction problems. CNNs learn to minimize a loss function – dependent from all others given the input image. Condi- an objective that scores the quality of results – and although tional GANs instead learn a structured loss. Structured the learning process is automatic, a lot of manual effort still losses penalize the joint configuration of the output. A large goes into designing effective losses. In other words, we still body of literature has considered losses of this kind, with have to tell the CNN what we wish it to minimize. But, popular methods including conditional random fields [2], just like Midas, we must be careful what we wish for! If the SSIM metric [40], feature matching [6], nonparametric we take a naive approach, and ask the CNN to minimize losses [24], the convolutional pseudo-prior [41], and losses Euclidean distance between predicted and ground truth pix- based on matching covariance statistics [19]. Our condi- els, it will tend to produce blurry results [29, 46]. This is tional GAN is different in that the loss is learned, and can, in because Euclidean distance is minimized by averaging all theory, penalize any possible structure that differs between plausible outputs, which causes blurring. Coming up with output and target. loss functions that force the CNN to do what we really want Conditional GANs We are not the first to apply GANs – e.g., output sharp, realistic images – is an open problem in the conditional setting. Previous works have conditioned and generally requires expert knowledge. GANs on discrete labels [28], text [32], and, indeed, im- It would be highly desirable if we could instead specify ages. The image-conditional models have tackled inpaint- only a high-level goal, like “make the output indistinguish- ing [29], image prediction from a normal map [39], image able from reality”, and then automatically learn a loss func- manipulation guided by user constraints [49], future frame tion appropriate for satisfying this goal. Fortunately, this is prediction [27], future state prediction [48], product photo exactly what is done by the recently proposed Generative generation [43], and style transfer [25]. Each of these meth- Adversarial Networks (GANs) [14, 5, 30, 36, 47]. GANs ods was tailored for a specific application. Our framework learn a loss that tries to classify if the output image is real differs in that nothing is application-specific. This makes or fake, while simultaneously training a generative model our setup considerably simpler than most others. to minimize this loss. Blurry images will not be tolerated Our method also differs from these prior works in sev- since they look obviously fake. Because GANs learn a loss eral architectural choices for the generator and discrimina- that adapts to the data, they can be applied to a multitude of tor. Unlike past work, for our generator we use a “U-Net”- tasks that traditionally would require very different kinds of based architecture [34], and for our discriminator we use a loss functions. convolutional “PatchGAN” classifier, which only penalizes In this paper, we explore GANs in the conditional set- structure at the scale of image patches. A similar Patch- ting. Just as GANs learn a generative model of data, condi- GAN architecture was previously proposed in [25], for the tional GANs (cGANs) learn a conditional generative model purpose of capturing local style statistics. Here we show [14]. This makes cGANs suitable for image-to-image trans- that this approach is effective on a wider range of problems, lation tasks, where we condition on an input image and gen- and we investigate the effect of changing the patch size. erate a corresponding output image. GANs have been vigorously studied in the last two 2. Method years and many of the techniques we explore in this pa- GANs are generative models that learn a mapping from per have been previously proposed. Nonetheless, ear- random noise vector z to output image y: G : z → y lier papers have focused on specific applications, and [14]. In contrast, conditional GANs learn a mapping from it has remained unclear how effective image-conditional observed image x and random noise vector z, to y: G : GANs can be as a general-purpose solution for image-to- {x, z} → y. The generator G is trained to produce outputs image translation. Our primary contribution is to demon- that cannot be distinguished from “real” images by an ad- strate that on a wide variety of problems, conditional versarially trained discrimintor, D, which is trained to do as GANs produce reasonable results. Our second contri- well as possible at detecting the generator’s “fakes”. This bution is to present a simple framework sufficient to training procedure is diagrammed in Figure 2. achieve good results, and to analyze the effects of sev- eral important architectural choices. Code is available at 2.1. Objective https://github.com/phillipi/pix2pix. The objective of a conditional GAN can be expressed as 1. Related work LcGAN (G, D) =Ex,y∼pdata (x,y) [log D(x, y)]+ Structured losses for image modeling Image-to-image Ex∼pdata (x),z∼pz (z) [log(1 − D(x, G(x, z))], translation problems are often formulated as per-pixel clas- (1)

3. Positive examples Negative examples Real or fake pair? Real or fake pair? D D Encoder-decoder U-Net G Figure 3: Two choices for the architecture of the generator. The G tries to synthesize fake “U-Net” [34] is an encoder-decoder with skip connections be- images that fool D tween mirrored layers in the encoder and decoder stacks. D tries to identify the fakes this strategy effective – the generator simply learned to ig- nore the noise – which is consistent with Mathieu et al. [27]. Figure 2: Training a conditional GAN to predict aerial photos from Instead, for our final models, we provide noise only in the maps. The discriminator, D, learns to classify between real and form of dropout, applied on several layers of our generator synthesized pairs. The generator learns to fool the discriminator. Unlike an unconditional GAN, both the generator and discrimina- at both training and test time. Despite the dropout noise, we tor observe an input image. observe very minor stochasticity in the output of our nets. Designing conditional GANs that produce stochastic out- put, and thereby capture the full entropy of the conditional where G tries to minimize this objective against an ad- distributions they model, is an important question left open versarial D that tries to maximize it, i.e. G∗ = by the present work. arg minG maxD LcGAN (G, D). To test the importance of conditioning the discrimintor, 2.2. Network architectures we also compare to an unconditional variant in which the We adapt our generator and discriminator architectures discriminator does not observe x: from those in [30]. Both generator and discriminator use LGAN (G, D) =Ey∼pdata (y) [log D(y)]+ modules of the form convolution-BatchNorm-ReLu [18]. Details of the architecture are provided in the appendix, Ex∼pdata (x),z∼pz (z) [log(1 − D(G(x, z))]. with key features discussed below. (2) Previous approaches to conditional GANs have found it 2.2.1 Generator with skips beneficial to mix the GAN objective with a more traditional loss, such as L2 distance [29]. The discriminator’s job re- A defining feature of image-to-image translation problems mains unchanged, but the generator is tasked to not only is that they map a high resolution input grid to a high resolu- fool the discriminator but also to be near the ground truth tion output grid. In addition, for the problems we consider, output in an L2 sense. We also explore this option, using the input and output differ in surface appearance, but both L1 distance rather than L2 as L1 encourages less blurring: are renderings of the same underlying structure. Therefore, structure in the input is roughly aligned with structure in the LL1 (G) = Ex,y∼pdata (x,y),z∼pz (z) [ y − G(x, z) 1 ]. (3) output. We design the generator architecture around these considerations. Our final objective is Many previous solutions [29, 39, 19, 48, 43] to problems G∗ = arg min max LcGAN (G, D) + λLL1 (G). (4) in this area have used an encoder-decoder network [16]. In G D such a network, the input is passed through a series of lay- Without z, the net could still learn a mapping from x to ers that progressively downsample, until a bottleneck layer, y, but would produce deterministic outputs, and therefore at which point the process is reversed (Figure 3). Such a fail to match any distribution other than a delta function. network requires that all information flow pass through all Past conditional GANs have acknowledged this and pro- the layers, including the bottleneck. For many image trans- vided Gaussian noise z as an input to the generator, in addi- lation problems, there is a great deal of low-level informa- tion to x (e.g., [39]). In initial experiments, we did not find tion shared between the input and output, and it would be

4.desirable to shuttle this information directly across the net. and we apply batch normalization [18] using the statistics of For example, in the case of image colorizaton, the input and the test batch, rather than aggregated statistics of the train- output share the location of prominent edges. ing batch. This approach to batch normalization, when the To give the generator a means to circumvent the bot- batch size is set to 1, has been termed “instance normaliza- tleneck for information like this, we add skip connections, tion” and has been demonstrated to be effective at image following the general shape of a “U-Net” [34] (Figure 3). generation tasks [38]. In our experiments, we use batch size Specifically, we add skip connections between each layer i 1 for certain experiments and 4 for others, noting little dif- and layer n − i, where n is the total number of layers. Each ference between these two conditions. skip connection simply concatenates all channels at layer i with those at layer n − i. 3. Experiments To explore the generality of conditional GANs, we test 2.2.2 Markovian discriminator (PatchGAN) the method on a variety of tasks and datasets, including both graphics tasks, like photo generation, and vision tasks, like It is well known that the L2 loss – and L1, see Fig- semantic segmentation: ure 4 – produces blurry results on image generation prob- lems [22]. Although these losses fail to encourage high- • Semantic labels↔photo, trained on the Cityscapes frequency crispness, in many cases they nonetheless accu- dataset [4]. rately capture the low frequencies. For problems where this • Architectural labels→photo, trained on the CMP Fa- is the case, we do not need an entirely new framework to cades dataset [31]. enforce correctness at the low frequencies. L1 will already • Map↔aerial photo, trained on data scraped from do. Google Maps. This motivates restricting the GAN discriminator to only • BW→color photos, trained on [35]. model high-frequency structure, relying on an L1 term to • Edges→photo, trained on data from [49] and [44]; bi- force low-frequency correctness (Eqn. 4). In order to model nary edges generated using the HED edge detector [42] high-frequencies, it is sufficient to restrict our attention to plus postprocessing. the structure in local image patches. Therefore, we design • Sketch→photo: tests edges→photo models on human- a discriminator architecture – which we term a PatchGAN drawn sketches from [10]. – that only penalizes structure at the scale of patches. This • Day→night, trained on [21]. discriminator tries to classify if each N × N patch in an Details of training on each of these datasets are pro- image is real or fake. We run this discriminator convoluta- vided in the Appendix. In all cases, the input and out- tionally across the image, averaging all responses to provide put are simply 1-3 channel images. Qualitative results the ultimate output of D. are shown in Figures 8, 9, 10, 11, 12, 14, 15, 16, In Section 3.4, we demonstrate that N can be much and 13. Several failure cases are highlighted in Fig- smaller than the full size of the image and still produce ure 17. More comprehensive results are available at high quality results. This is advantageous because a smaller https://phillipi.github.io/pix2pix/. PatchGAN has fewer parameters, runs faster, and can be Data requirements and speed We note that decent re- applied on arbitrarily large images. sults can often be obtained even on small datasets. Our fa- Such a discriminator effectively models the image as a cade training set consists of just 400 images (see results in Markov random field, assuming independence between pix- Figure 12), and the day to night training set consists of only els separated by more than a patch diameter. This con- 91 unique webcams (see results in Figure 13). On datasets nection was previously explored in [25], and is also the of this size, training can be very fast: for example, the re- common assumption in models of texture [8, 12] and style sults shown in Figure 12 took less than two hours of training [7, 15, 13, 24]. Our PatchGAN can therefore be understood on a single Pascal Titan X GPU. At test time, all models run as a form of texture/style loss. in well under a second on this GPU. 2.3. Optimization and inference 3.1. Evaluation metrics To optimize our networks, we follow the standard ap- Evaluating the quality of synthesized images is an open proach from [14]: we alternate between one gradient de- and difficult problem [36]. Traditional metrics such as per- scent step on D, then one step on G. We use minibatch pixel mean-squared error do not assess joint statistics of the SGD and apply the Adam solver [20]. result, and therefore do not measure the very structure that At inference time, we run the generator net in exactly structured losses aim to capture. the same manner as during the training phase. This differs In order to more holistically evaluate the visual qual- from the usual protocol in that we apply dropout at test time, ity of our results, we employ two tactics. First, we run

5. L1 L1+cGAN Loss Per-pixel acc. Per-class acc. Class IOU Encoder-decoder L1 0.44 0.14 0.10 GAN 0.22 0.05 0.01 cGAN 0.61 0.21 0.16 L1+GAN 0.64 0.19 0.15 L1+cGAN 0.63 0.21 0.16 Ground truth 0.80 0.26 0.21 Table 1: FCN-scores for different losses, evaluated on Cityscapes U-Net labels↔photos. “real vs fake” perceptual studies on Amazon Mechanical Figure 5: Adding skip connections to an encoder-decoder to create Turk (AMT). For graphics problems like colorization and a “U-Net” results in much higher quality results. photo generation, plausibility to a human observer is often the ultimate goal. Therefore, we test our map generation, Discriminator aerial photo generation, and image colorization using this receptive field Per-pixel acc. Per-class acc. Class IOU approach. 1×1 0.44 0.14 0.10 16×16 0.62 0.20 0.16 Second, we measure whether or not our synthesized 70×70 0.63 0.21 0.16 256×256 0.47 0.18 0.13 cityscapes are realistic enough that off-the-shelf recognition system can recognize the objects in them. This metric is Table 2: FCN-scores for different receptive field sizes of the dis- criminator, evaluated on Cityscapes labels→photos. similar to the “inception score” from [36], the object detec- tion evaluation in [39], and the “semantic interpretability” measure in [46]. 3.2. Analysis of the objective function AMT perceptual studies For our AMT experiments, we Which components of the objective in Eqn. 4 are impor- followed the protocol from [46]: Turkers were presented tant? We run ablation studies to isolate the effect of the L1 with a series of trials that pitted a “real” image against a term, the GAN term, and to compare using a discriminator “fake” image generated by our algorithm. On each trial, conditioned on the input (cGAN, Eqn. 1) against using an each image appeared for 1 second, after which the images unconditional discriminator (GAN, Eqn. 2). disappeared and Turkers were given unlimited time to re- Figure 4 shows the qualitative effects of these variations spond as to which was fake. The first 10 images of each on two labels→photo problems. L1 alone leads to reason- session were practice and Turkers were given feedback. No able but blurry results. The cGAN alone (setting λ = 0 in feedback was provided on the 40 trials of the main experi- Eqn. 4) gives much sharper results, but results in some arti- ment. Each session tested just one algorithm at a time, and facts in facade synthesis. Adding both terms together (with Turkers were not allowed to complete more than one ses- λ = 100) reduces these artifacts. sion. ∼ 50 Turkers evaluated each algorithm. All images We quantify these observations using the FCN-score on were presented at 256 × 256 resolution. Unlike [46], we the cityscapes labels→photo task (Table 1): the GAN-based did not include vigilance trials. For our colorization ex- objectives achieve higher scores, indicating that the synthe- periments, the real and fake images were generated from sized images include more recognizable structure. We also the same grayscale input. For map↔aerial photo, the real test the effect of removing conditioning from the discrimi- and fake images were not generated from the same input, in nator (labeled as GAN). In this case, the loss does not pe- order to make the task more difficult and avoid floor-level nalize mismatch between the input and output; it only cares results. that the output look realistic. This variant results in very FCN-score While quantitative evaluation of generative poor performance; examining the results reveals that the models is known to be challenging, recent works [36, generator collapsed into producing nearly the exact same 39, 46] have tried using pre-trained semantic classifiers to output regardless of input photograph. Clearly it is impor- measure the discriminability of the generated images as a tant, in this case, that the loss measure the quality of the pseudo-metric. The intuition is that if the generated images match between input and output, and indeed cGAN per- are realistic, classifiers trained on real images will be able forms much better than GAN. Note, however, that adding to classify the synthesized image correctly as well. To this an L1 term also encourages that the output respect the in- end, we adopt the popular FCN-8s [26] architecture for se- put, since the L1 loss penalizes the distance between ground mantic segmentation, and train it on the cityscapes dataset. truth outputs, which match the input, and synthesized out- We then score synthesized photos by the classification accu- puts, which may not. Correspondingly, L1+GAN is also racy against the labels these photos were synthesized from. effective at creating realistic renderings that respect the in-

6. Input Ground truth L1 cGAN L1 + cGAN Figure 4: Different losses induce different quality of results. Each column shows results trained under a different loss. Please see https://phillipi.github.io/pix2pix/ for additional examples. L1 1x1 16x16 70x70 256x256 Figure 6: Patch size variations. Uncertainty in the output manifests itself differently for different loss functions. Uncertain regions become blurry and desaturated under L1. The 1x1 PixelGAN encourages greater color diversity but has no effect on spatial statistics. The 16x16 PatchGAN creates locally sharp results, but also leads to tiling artifacts beyond the scale it can observe. The 70x70 PatchGAN forces outputs that are sharp, even if incorrect, in both the spatial and spectral (coforfulness) dimensions. The full 256x256 ImageGAN produces results that are visually similar to the 70x70 PatchGAN, but somewhat lower quality according to our FCN-score metric (Table 2). Please see https://phillipi.github.io/pix2pix/ for additional examples. put label maps. Combining all terms, L1+cGAN, performs butions over output color values in Lab color space. The similarly well. ground truth distributions are shown with a dotted line. It is apparent that L1 leads to a narrower distribution than the Colorfulness A striking effect of conditional GANs is ground truth, confirming the hypothesis that L1 encourages that they produce sharp images, hallucinating spatial struc- average, grayish colors. Using a cGAN, on the other hand, ture even where it does not exist in the input label map. One pushes the output distribution closer to the ground truth. might imagine cGANs have a similar effect on “sharpening” in the spectral dimension – i.e. making images more color- 3.3. Analysis of the generator architecture ful. Just as L1 will incentivize a blur when it is uncertain where exactly to locate an edge, it will also incentivize an A U-Net architecture allows low-level information to average, grayish color when it is uncertain which of several shortcut across the network. Does this lead to better results? plausible color values a pixel should take on. Specially, L1 Figure 5 compares the U-Net against an encoder-decoder on will be minimized by choosing the median of of the con- cityscape generation U-Net. The encoder-decoder is created ditional probability density function over possible colors. simply by severing the skip connections in the U-Net. The An adversarial loss, on the other hand, can in principle be- encoder-decoder is unable to learn to generate realistic im- come aware that grayish outputs are unrealistic, and encour- ages in our experiments, and indeed collapses to producing age matching the true color distribution [14]. In Figure 7, nearly identical results for each input label map. The advan- we investigate if our cGANs actually achieve this effect on tages of the U-Net appear not to be specific to conditional the Cityscapes dataset. The plots show the marginal distri- GANs: when both U-Net and encoder-decoder are trained

7. L L L L −1 −1 −1 −1 CVPR CVPR −1−1 CVPR CVPR −3 −3 −3 CVPR CVPR #385 #385 −3−3 −3 #385 #385 #385 #385 −5 −5 −5 −5−5 CVPR CVPR 2016 CVPR2016 Submission 2016Submission Submission #385. #385. CONFIDENTIAL −5 CONFIDENTIAL #385. REVIEW REVIEW CONFIDENTIAL COPY. COPY. REVIEW DO COPY. DO NOT NOT DO DISTRIBUTE. DISTRIBUTE. NOT DISTRIBUTE. −7 −7 −7 −7 −7 −7 L1L1 L1 L1 L1L1 cGAN cGAN cGAN cGAN cGAN cGAN −9 −9−9 L1+cGAN −9 −9 −9 L1+cGAN L1+cGAN L1+cGAN L1+cGAN L1+cGAN L1+pixelcGAN L1+pixelcGAN L1+pixelcGAN L1+pixelcGAN L1+pixelcGAN L1+pixelcGAN 648 648 648 −11 −11 −11 0 00 2020 4040 6060 Ground Ground Ground 8080 truth truthtruth 100 −11−11 −11 0 00 20 20 40 40 60 60 Ground Ground truthtruth Ground 80 truth 100 702702 702 20 40 60 80 100100 20 40 60 80 80 100 100 649 649 649 L LL b bb a aa b bb a aa 703703 703 650 650 650 −1 −1−1 −1 −1−1 −1−1 −1 −1 −1 −1 −1 −1 −1 704704 704 651 651 −3 −3 −3−3 −3 −3 −3 −3 Histogram Histogram Histogram intersection intersection intersection 705705 651 705 −3 −3 −3 Histogram intersection −3−3 −3 −3 (L) (L) (a) against ground truth log P (a) against ground truth (b) log P (b) P(L) logPP(a) against ground truth logPP(b) 652 652 652 −5 −5 −5 −5 −5 −5 −5−5 −5 −5 −5 −5 −5 −5 −5 Loss Loss L L against a a ground truth b bb 706706 706 logPP Loss Loss L L a a b log log log 653 653 707707 −7 −7 −7 −7 log −7 −7 −7−7 653 −7 −7 −7 −7 L1L1 −7 −7 −7 L1L1L1 0.81 0.81 0.810.69 0.69 0.69 0.700.70 0.70 707 L1 cGAN cGAN −9 −9 −9 −9 −9 cGAN L1 cGAN 0.87 0.81 0.87 0.74 0.69 0.74 0.70 0.840.84 654 654 708708 −9 −9 −9−9 654 −9 −9 −9 −9 cGAN L1+cGAN L1+cGAN L1+cGAN −9 −9 cGAN cGAN 0.87 0.870.84 0.74 0.84 0.84 0.74 0.820.82 708 −11 −11 L1+pixelcGAN L1+pixelcGAN L1+pixelcGAN −11 −11−11 −11 −11 L1+cGAN L1+cGAN L1+cGAN 0.86 0.86 0.86 0.84 0.84 0.82 655 709709 −11 −11 655 −11 Ground Groundtruth truth 655 −11 00 −11 7070 20 9090 40 110 60110 Ground truth 130 130 100 8080 150 100150 −11 7070 90 90 110110 130130 −11 70 70 90 90 110110 130 130 150 150 −11 70 70 L1+cGAN 70 PixelGAN PixelGAN 90 90 110 110 0.86 0.83 130 130 0.84 110 0.83 0.68 0.68 0.78 0.82 0.78 709 PixelGAN 0.83 0.68 20 40 60 0 7020 L 90 40 110 60 130 90 a a aa 80 100150 110 13070 90 b110bb 130 70 150 90 130 0.78 656 656 bL b L a(b)a PixelGAN 0.83 0.68 0.78 710710 656 (a) (a)b (a) (b) (b) (c)(c) (c) (d)(d) 710 Figure 5: (a) distributionmatching (b) the (c) (d)GAN 657 Figure 5:Color 5: Colordistribution Color distribution matching property propertyofof property of the cGAN, cGAN, tested testedonon Cityscapes. Cityscapes. (c.f. Figure (c.f. Figure1 of 11ofthe original ofthe GAN original paper paper[14]). Note [14]). Note 711711 −1 −1−1 657 657 −1 Figure −1 matching −1 the cGAN, tested on Cityscapes. (c.f. Figure the (d) original GAN paper [14]). Note 711 658 658 that that−3 the histogram intersection −3 that−3 the histogram intersection the histogram −3−3 intersection scores scoresare aredominated dominated by bydifferences differences in the in high the high probability probability region, −3 scores are dominated by differences in the high probability region, which are imperceptible which region, are which imperceptible are in in imperceptible the in plots, the theplots, plots, 712712 658 Figure −57: Color distribution matching property of the cGAN, tested on Cityscapes. (c.f. Figure 1 of the original GAN paper [14]). Note 712 659 which which which show showlog show log log probability probabilityand probability and −5therefore −5−5 and therefore thereforeemphasize emphasize emphasize differences differences differences inin the in thelowlowprobability probability regions. regions. 713713 in the the lowhighprobability regions. −5 659 659 that the −5 713 −7 histogram intersection scores −7−7 are dominated by differences probability region, which are imperceptible in the plots, 660 660 −7 −7 −7 714714 660 which show log probability L1 and therefore emphasize 1x1 differences in the low 16x16 probability regions. 70x70 256x256 714 661 661 −9 −9 −9 L1 L1 −9−9 1x1 −9 1x1 16x16 16x16 70x70 256x256 70x70 256x256 715715 661 −11 −11 715 662 662 −11 −11 −11 −11 716716 662 716 70 70 90 90 110 110 130 130 150 150 7070 9090 110 110 130 130 70 90 110 130 150 70 90 110 130 663 663 with an L1 loss, the U-Net again achieves the superior re- 663 Classification Ours 717717 717 664 L2 [46] (rebal.) [46] (L1 + cGAN) Ground truth 718718 664 sults (Figure 5). 664 718 665 665 719719 665 719 666 Figure Figure 6: Patch size variations. Uncertainty ininthe output manifests itself differently for different loss functions. Uncertain regions become 720720 666 666 Figure 6: 6: Patch Patch sizesize variations. variations. Uncertainty Uncertainty inthe theoutput outputmanifests manifestsitself itselfdifferently differentlyfor fordifferent differentloss lossfunctions. functions.Uncertain Uncertainregionsregionsbecome become 720 667 667 3.4. From blurry blurry and blurry and PixelGANs desaturated and desaturated desaturated under to under PatchGans L1. under L1. The L1. The 1x1 The 1x1 to ImageGANs PixelGAN 1x1 PixelGAN encourages PixelGAN encourages greater encourages greater color greatercolor diversity colordiversity diversitybut but has buthas no hasno effect noeffect on effecton spatial onspatial statistics. spatialstatistics. The statistics. The 16x16 The16x16 16x16 721721 667 PatchGAN creates locally sharp results, but also leads tototiling artifacts beyond thethescale it it can observe. The 70x70 PatchGAN forces 721 668 668 PatchGAN PatchGAN creates creates locally locally sharp sharp results, results, but but also also leads leads to tiling tiling artifacts artifacts beyond beyond the scale scale it can can observe. observe. The The 70x70 70x70 PatchGAN PatchGAN forces forces 722722 668 We test the outputs outputs that effect are sharp, of varying the patch even ififincorrect, size the in both of ourand N spatial dis-spectral (coforfulness) dimensions. The fullfull256x256 ImageGAN produces 722 669 669 outputs that that areare sharp, sharp, even even if incorrect, incorrect, in inboth boththe thespatial spatialand andspectral spectral(coforfulness) (coforfulness)dimensions. dimensions.The The full256x256 256x256ImageGAN ImageGANproduces produces 723723 results 669 criminator that are visually receptive fields, similarfrom totothe 170x70 PatchGAN, buttosomewhat lower quality according to our FCN-score metric (Table 2). 723 670 670 results results that that are are visually visually similar similar toathe the × 70x70 1 “PixelGAN” 70x70 PatchGAN, PatchGAN,but asomewhat butsomewhat lower lowerquality qualityaccording accordingtotoour ourFCN-score FCN-scoremetricmetric(Table (Table2). 2). 724724 670 full 256 × 256 “ImageGAN”1 . Figure 6 shows qualitative 724 671 671 725725 671 results of this analysis and Table 2 quantifies the effects us- 725 672 Classification Classification Ours Ours Input Input Ground Ground truth L1L1 cGAN 726726 672 672 L2 673 ing the FCN-score. L2 [44] [44] Classification (rebal.) Note (rebal.) that [44] elsewhere [44] (L1 (L1 ++ Ours cGAN) incGAN) this paper, Ground Ground truth unless truth Input Groundtruth truth L1 cGAN cGAN 727 726 673 L2 [44] (rebal.) [44] (L1 + cGAN) Ground truth 673 specified, all experiments use 70 × 70 PatchGANs, and for 727 727 674 728 674 674 728 728 675 this section all experiments use an L1+cGAN loss. 729 675 675 729 729 676 The PixelGAN has no effect on spatial sharpness, but 730 676 676 730 730 677 does increase the colorfulness of the results (quantified in 731 677 677 731 678 Figure 7). For example, the bus in Figure 6 is painted gray Figure 8: Applying a conditional GAN to semantic segmentation. 732 731 678 Figure 8: 732 678 679 The Figure cGAN 8: Applying Applyingasharp produces aconditional conditional GAN totosemantic images GANthat looksemantic at glance segmentation. segmentation. like the 733 732 679 when the net is trained with an L1 loss, but becomes red The cGAN produces sharp images that look atat glance like 733 679 680 The ground cGAN truth, butproduces in fact sharp include images many that small, look hallucinated glance like the objects. the 734 733 680 680 with the PixelGAN loss. Color histogram matching is a ground truth, but in fact include many small, hallucinated objects. 734 681 ground truth, but in fact include many small, hallucinated objects. 735 734 681 681 common problem in image processing [33], and PixelGANs 735 682 736 735 682 may be a promising lightweight solution. 682 683 736 737 736 683 nearly discrete, rather than “images”, with their continuous- 737 683 684 Using a 16×16 PatchGAN is sufficient to promote sharp nearly discrete, discrete,rather nearlyvariation. ratherthan “images”, thancGANs “images”, with withtheir theircontinuous- continuous- 738 737 684 valued Although achieve some success, 738 684 outputs, but also leads to tiling artifacts. The 70 × 70 Patch- 685 valued valued variation. variation. Although Although cGANs cGANs achieve achieve some some success, success, 739 738 685 they are far from the best available method for solving this 739 685 GAN alleviates these artifacts. Scaling beyond this, to the 686 they they are are far far from from the the best best available available method method for for solving solving this this 740 739 686 problem: Figure simply 9: Colorization using L1 regression resultsL1ofregression gets conditionalgets better GANs scores versus than the L2 740 686 full 256 × 256 ImageGAN, does not appear to improve the 687 problem: problem: simply simply using using L1 regression gets better better scores scores than than 741 740 687 using a cGAN, regression from as shown [46] and the in full Table 4. We(classification method argue that forwith vision re- 741 687 visual quality of the results, and in fact gets a considerably 688 using usingaafrom problems, cGAN, cGAN, the [48]. goal as asshown shown (i.e. in inTable Table predicting 4.4. We Weargue output argue close that that for forvision to ground vision 742 741 688 Figure 7: Colorization balancing) The cGANs can produce compelling col- 742 688 lower 689 FCN-score (Tableresults 2). This of conditional may be because GANs versus the the Im- L2 problems, problems, the the goal goal (i.e. (i.e. predicting output close close to to ground 743 742 689 Figure Figure regression 7: 7: Colorization Colorization from [44] and results results the of fullof conditional conditional method GANs GANs (classification versus versus with the the L2 re-L2 truth) orizations may(firstbe less two ambiguous rows), butpredicting than have a commonoutputfailure graphics tasks, and mode ground re-of 743 690 ageGAN 689 regressionhas many from moreand [44] parameters the full and greater method depth than (classification with re- truth) truth) construction may may be be lossesless less ambiguous like ambiguous L1 are than than mostly graphics graphics sufficient. tasks, tasks, and and re- re- 744 743 690 regression balancing) from from [44] [46]. and The the cGANs full method can (classification produce compelling withcol- re- producing a grayscale or desaturated result (last row). 744 690 the 70 691 × 70 PatchGAN, balancing) from [46]. and The may cGANs be harder can to train. produce compelling col- construction construction losses losses like like L1 L1 are are mostly mostly sufficient. sufficient. 745 744 691 orizations balancing)(first fromtwo [46].rows), Thebut cGANs have acan common producefailure mode of compelling col- 745 691 692 orizations 746 745 692 orizationsa(first Fully-convolutional producing grayscale (first two two rows), but ortranslationbut have desaturated rows), have Anaa common result advantage (last common failure row).failureofmodethe of mode of Photo → Map Map → Photo 746 692 693 producing 747 746 PatchGAN producing is athat a grayscale a fixed-size grayscale or or desaturated patch result desaturated (last (lastrow). discriminator result row). can be Loss % Turkers labeled real % Turkers labeled real 693 693 694 applied to arbitrarily large images. We may also apply the 4. Conclusion L1 2.8% ± 1.0% 0.8% ± 0.3% 748747 747 694 694 695 To begin to test this, we train a cGAN (with/without L1 4. Conclusion 4.L1+cGAN Conclusion6.1% ± 1.3% 18.9% ± 2.5% 749748 748 generator convolutionally, on larger images than those on L1 695 696 695 loss)To Toonbegin to cityscape begin to test this, this, we testphoto!labels. we traintrain aaFigure cGAN cGAN8(with/without shows qualita-L1 (with/without Table 3: AMTin“real The results this vs fake”suggest paper test on maps↔aerial that conditional photos. adver- 750749 749 which it was loss)results, trained. on cityscape We test photo!labels.this on the map↔aerial 8 showsphoto Figureaccuracies qualita- 696 697 696 tive and quantitative classification loss) on cityscape photo!labels. Figure 8 shows qualita- are re- sarialThe resultsare networks in this paper suggest a promising approach thatforconditional many image- The results in this paper suggest that conditional adver- adver- 751750 750 task.tive After training a generator onclassification 256×256 images, we test 697 698 697 tive results, portedresults, and and4.quantitative in Table Interestingly, quantitative cGANs, trained classification accuracies without accuracies are are re- there- sarial sarial networks to-image translation networks Method are aa promising aretasks, especially promising approach those approach % Turkers for realmany involving labeled for many image- highly image- 752751 751 it on 512 × ported 512 in images. Table 4. The results Interestingly, in Figure cGANs, 8 demonstrate trained without the to-image 698 699 698 L1 loss, are able to solve this problem at a reasonable degree ported in Table 4. Interestingly, cGANs, trained without the to-image translation tasks, especially those involvinghighly structured L2translation graphical regression tasks, outputs. from [46] especially These those networks 16.3% ± 2.4% involving learn a loss highly 753752 752 the effectiveness L1 loss, of this approach. loss, are areable to tosolve this thisproblem this isat aareasonable degree structured graphical et al. 2016outputs. These networks 699 700 699 ofL1accuracy. To able our knowledge, solve problem the at first demonstra- reasonable degree adapted toZhang structured the task and data graphical [46] 27.8% ± at hand, outputs. which These makeslearn 2.7% networks them learn aap- a loss loss 754753 753 Ours 22.5% ± 1.6% 700 701 700 of tion accuracy. of GANs To our knowledge, successfully this generating is the first “labels”, of accuracy. To our knowledge, this is the first demonstra- demonstra- which are adapted adapted to the task and data at hand, which makes themap- plicable into a the wide task and variety data of at hand, settings. which makes them ap- 755754 754 tion of GANs successfully generating “labels”, which are Table 4:a AMT plicable “real vs of fake” test on colorization. 701 701 1 Wetion achieve this variation in patch size by adjusting the depth of the are of GANs successfully generating “labels”, which plicablein in awide widevariety variety ofsettings. settings. 755 755 GAN discriminator. Details of this process, and the discriminator architec- tures are provided in the appendix 7 77

8. To begin to test this, we train a cGAN (with/without L1 Loss Per-pixel acc. Per-class acc. Class IOU L1 0.86 0.42 0.35 loss) on cityscape photo→labels. Figure 10 shows qualita- cGAN 0.74 0.28 0.22 tive results, and quantitative classification accuracies are re- L1+cGAN 0.83 0.36 0.29 ported in Table 5. Interestingly, cGANs, trained without the Table 5: Performance of photo→labels on cityscapes. L1 loss, are able to solve this problem at a reasonable degree Input Ground truth L1 cGAN of accuracy. To our knowledge, this is the first demonstra- tion of GANs successfully generating “labels”, which are nearly discrete, rather than “images”, with their continuous- valued variation2 . Although cGANs achieve some success, they are far from the best available method for solving this problem: simply using L1 regression gets better scores than using a cGAN, as shown in Table 5. We argue that for vision problems, the goal (i.e. predicting output close to ground truth) may be less ambiguous than graphics tasks, and re- construction losses like L1 are mostly sufficient. Figure 10: Applying a conditional GAN to semantic segmenta- 4. Conclusion tion. The cGAN produces sharp images that look at glance like The results in this paper suggest that conditional adver- the ground truth, but in fact include many small, hallucinated ob- sarial networks are a promising approach for many image- jects. to-image translation tasks, especially those involving highly structured graphical outputs. These networks learn a loss 3.5. Perceptual validation adapted to the task and data at hand, which makes them ap- plicable in a wide variety of settings. We validate the perceptual realism of our results on the tasks of map↔aerial photograph and grayscale→color. Re- Acknowledgments: We thank Richard Zhang and Deepak Pathak sults of our AMT experiment for map↔photo are given in for helpful discussions. This work was supported in part by NSF SMA- Table 3. The aerial photos generated by our method fooled 1514512, NGA NURI, IARPA via Air Force Research Laboratory, Intel participants on 18.9% of trials, significantly above the L1 Corp, and hardware donations by nVIDIA. Disclaimer: The views and baseline, which produces blurry results and nearly never conclusions contained herein are those of the authors and should not be in- fooled participants. In contrast, in the photo→map direc- terpreted as necessarily representing the official policies or endorsements, tionm our method only fooled participants on 6.1% of tri- either expressed or implied, of IARPA, AFRL or the U.S. Government. als, and this was not significantly different than the perfor- mance of the L1 baseline (based on bootstrap test). This may be because minor structural errors are more visible in maps, which have rigid geometry, than in aerial pho- tographs, which are more chaotic. We trained colorization on ImageNet [35], and tested on the test split introduced by [46, 23]. Our method, with L1+cGAN loss, fooled participants on 22.5% of trials (Ta- ble 4). We also tested the results of [46] and a variant of their method that used an L2 loss (see [46] for details). The conditional GAN scored similarly to the L2 variant of [46] (difference insignificant by bootstrap test), but fell short of [46]’s full method, which fooled participants on 27.8% of trials in our experiment. We note that their method was specifically engineered to do well on colorization. 3.6. Semantic segmentation Conditional GANs appear to be effective on problems where the output is highly detailed or photographic, as is common in image processing and graphics tasks. What 2 Note that the label maps we train on are not exactly discrete valued, about vision problems, like semantic segmentation, where as they are resized from the original maps using bilinear interpolation and the output is instead less complex than the input? saved as jpeg images, with some compression artifacts.

9. Aerial photo to map Map to aerial photo input output input output Figure 8: Example results on Google Maps at 512x512 resolution (model was trained on images at 256x256 resolution, and run convolu- tionally on the larger images at test time). Contrast adjusted for clarity. Input Ground truth Output Input Ground truth Output Figure 11: Example results of our method on Cityscapes labels→photo, compared to ground truth.

10.Input Ground truth Output Input Ground truth Output Figure 12: Example results of our method on facades labels→photo, compared to ground truth

11.Input Ground truth Output Input Ground truth Output Figure 13: Example results of our method on day→night, compared to ground truth. Input Ground truth Output Input Ground truth Output Figure 14: Example results of our method on automatically detected edges→handbags, compared to ground truth.

12. Input Ground truth Output Input Ground truth Output Figure 15: Example results of our method on automatically detected edges→shoes, compared to ground truth. Input Output Input Output Input Output Input Output Figure 16: Example results of the edges→photo models applied to human-drawn sketches from [10]. Note that the models were trained on automatically detected edges, but generalize to human drawings

13. Day Night Labels Facade Labels Street scene Edges Shoe Edges Handbag Sketch Shoe Sketch Handbag Figure 17: Example failure cases. Each pair of images shows input on the left and output on the right. These examples are selected as some of the worst results on our tasks. Common failures include artifacts in regions where the input image is sparse, and difficulty in handling unusual inputs. Please see https://phillipi.github.io/pix2pix/ for more comprehensive results.

14.References [18] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. [1] A. Buades, B. Coll, and J.-M. Morel. A non-local algo- 2015. 3, 4 rithm for image denoising. In CVPR, volume 2, pages 60–65. [19] J. Johnson, A. Alahi, and L. Fei-Fei. Perceptual losses for IEEE, 2005. 1 real-time style transfer and super-resolution. 2016. 2, 3 [2] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and [20] D. Kingma and J. Ba. Adam: A method for stochastic opti- A. L. Yuille. Semantic image segmentation with deep con- mization. ICLR, 2015. 4 volutional nets and fully connected crfs. In ICLR, 2015. 2 [21] P.-Y. Laffont, Z. Ren, X. Tao, C. Qian, and J. Hays. Transient [3] T. Chen, M.-M. Cheng, P. Tan, A. Shamir, and S.-M. Hu. attributes for high-level understanding and editing of outdoor Sketch2photo: internet image montage. ACM Transactions scenes. ACM Transactions on Graphics (TOG), 33(4):149, on Graphics (TOG), 28(5):124, 2009. 1 2014. 1, 4, 16 [4] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, [22] A. B. L. Larsen, S. K. Sønderby, and O. Winther. Autoen- R. Benenson, U. Franke, S. Roth, and B. Schiele. The coding beyond pixels using a learned similarity metric. arXiv cityscapes dataset for semantic urban scene understanding. preprint arXiv:1512.09300, 2015. 4 In CVPR), 2016. 4, 16 [23] G. Larsson, M. Maire, and G. Shakhnarovich. Learning rep- [5] E. L. Denton, S. Chintala, R. Fergus, et al. Deep genera- resentations for automatic colorization. ECCV, 2016. 2, 8, tive image models using a laplacian pyramid of adversarial 16 networks. In NIPS, pages 1486–1494, 2015. 2 [24] C. Li and M. Wand. Combining markov random fields and [6] A. Dosovitskiy and T. Brox. Generating images with per- convolutional neural networks for image synthesis. CVPR, ceptual similarity metrics based on deep networks. arXiv 2016. 2, 4 preprint arXiv:1602.02644, 2016. 2 [25] C. Li and M. Wand. Precomputed real-time texture synthe- [7] A. A. Efros and W. T. Freeman. Image quilting for tex- sis with markovian generative adversarial networks. ECCV, ture synthesis and transfer. In SIGGRAPH, pages 341–346. 2016. 2, 4 ACM, 2001. 1, 4 [26] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional [8] A. A. Efros and T. K. Leung. Texture synthesis by non- networks for semantic segmentation. In CVPR, pages 3431– parametric sampling. In ICCV, volume 2, pages 1033–1038. 3440, 2015. 1, 2, 5 IEEE, 1999. 4 [27] M. Mathieu, C. Couprie, and Y. LeCun. Deep multi-scale [9] D. Eigen and R. Fergus. Predicting depth, surface normals video prediction beyond mean square error. ICLR, 2016. 2, and semantic labels with a common multi-scale convolu- 3 tional architecture. In Proceedings of the IEEE International [28] M. Mirza and S. Osindero. Conditional generative adversar- Conference on Computer Vision, pages 2650–2658, 2015. 1 ial nets. arXiv preprint arXiv:1411.1784, 2014. 2 [10] M. Eitz, J. Hays, and M. Alexa. How do humans sketch [29] D. Pathak, P. Krahenbuhl, J. Donahue, T. Darrell, and A. A. objects? SIGGRAPH, 31(4):44–1, 2012. 4, 12 Efros. Context encoders: Feature learning by inpainting. [11] R. Fergus, B. Singh, A. Hertzmann, S. T. Roweis, and W. T. CVPR, 2016. 2, 3 Freeman. Removing camera shake from a single photograph. [30] A. Radford, L. Metz, and S. Chintala. Unsupervised repre- In ACM Transactions on Graphics (TOG), volume 25, pages sentation learning with deep convolutional generative adver- 787–794. ACM, 2006. 1 sarial networks. arXiv preprint arXiv:1511.06434, 2015. 2, [12] L. A. Gatys, A. S. Ecker, and M. Bethge. Texture synthesis 3, 16 and the controlled generation of natural stimuli using convo- ˇ Radim Tyleˇcek. Spatial pattern templates for recogni- [31] R. S. lutional neural networks. arXiv preprint arXiv:1505.07376, tion of objects with regular structure. In Proc. GCPR, Saar- 12, 2015. 4 brucken, Germany, 2013. 4, 16 [13] L. A. Gatys, A. S. Ecker, and M. Bethge. Image style transfer [32] S. Reed, Z. Akata, X. Yan, L. Logeswaran, B. Schiele, and using convolutional neural networks. CVPR, 2016. 4 H. Lee. Generative adversarial text to image synthesis. arXiv [14] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, preprint arXiv:1605.05396, 2016. 2 D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Gen- [33] E. Reinhard, M. Ashikhmin, B. Gooch, and P. Shirley. Color erative adversarial nets. In NIPS, 2014. 2, 4, 6, 7 transfer between images. IEEE Computer Graphics and Ap- [15] A. Hertzmann, C. E. Jacobs, N. Oliver, B. Curless, and D. H. plications, 21:34–41, 2001. 7 Salesin. Image analogies. In SIGGRAPH, pages 327–340. [34] O. Ronneberger, P. Fischer, and T. Brox. U-net: Convolu- ACM, 2001. 1, 4 tional networks for biomedical image segmentation. In MIC- [16] G. E. Hinton and R. R. Salakhutdinov. Reducing the CAI, pages 234–241. Springer, 2015. 2, 3, 4 dimensionality of data with neural networks. Science, [35] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, 313(5786):504–507, 2006. 3 S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, [17] S. Iizuka, E. Simo-Serra, and H. Ishikawa. Let there be et al. Imagenet large scale visual recognition challenge. Color!: Joint End-to-end Learning of Global and Local Im- IJCV, 115(3):211–252, 2015. 4, 8, 16 age Priors for Automatic Image Colorization with Simulta- [36] T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Rad- neous Classification. ACM Transactions on Graphics (TOG), ford, and X. Chen. Improved techniques for training gans. 35(4), 2016. 2 arXiv preprint arXiv:1606.03498, 2016. 2, 4, 5

15.[37] Y. Shih, S. Paris, F. Durand, and W. T. Freeman. Data-driven hallucination of different times of day from a single outdoor photo. ACM Transactions on Graphics (TOG), 32(6):200, 2013. 1 [38] D. Ulyanov, A. Vedaldi, and V. Lempitsky. Instance normal- ization: The missing ingredient for fast stylization. arXiv preprint arXiv:1607.08022, 2016. 4 [39] X. Wang and A. Gupta. Generative image modeling using style and structure adversarial networks. ECCV, 2016. 2, 3, 5 [40] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli. Image quality assessment: from error visibility to struc- tural similarity. IEEE Transactions on Image Processing, 13(4):600–612, 2004. 2 [41] S. Xie, X. Huang, and Z. Tu. Top-down learning for struc- tured labeling with convolutional pseudoprior. 2015. 2 [42] S. Xie and Z. Tu. Holistically-nested edge detection. In ICCV, 2015. 1, 2, 4 [43] D. Yoo, N. Kim, S. Park, A. S. Paek, and I. S. Kweon. Pixel- level domain transfer. ECCV, 2016. 2, 3 [44] A. Yu and K. Grauman. Fine-Grained Visual Comparisons with Local Learning. In CVPR, 2014. 4 [45] A. Yu and K. Grauman. Fine-grained visual comparisons with local learning. In CVPR, pages 192–199, 2014. 16 [46] R. Zhang, P. Isola, and A. A. Efros. Colorful image coloriza- tion. ECCV, 2016. 1, 2, 5, 7, 8, 16 [47] J. Zhao, M. Mathieu, and Y. LeCun. Energy-based genera- tive adversarial network. arXiv preprint arXiv:1609.03126, 2016. 2 [48] Y. Zhou and T. L. Berg. Learning temporal transformations from time-lapse videos. In ECCV, 2016. 2, 3, 7 [49] J.-Y. Zhu, P. Kr¨ahenb¨uhl, E. Shechtman, and A. A. Efros. Generative visual manipulation on the natural image mani- fold. In ECCV, 2016. 2, 4, 16

16.5. Appendix 256 × 256 discriminator: C64-C128-C256-C512-C512-C512 5.1. Network architectures We adapt our network architectures from those Note the the 256 × 256 discriminator has receptive fields in [30]. Code for the models is available at that could cover up to 574 × 574 pixels, if they were avail- https://github.com/phillipi/pix2pix. able, but since the input images are only 256 × 256 pixels, Let Ck denote a Convolution-BatchNorm-ReLU layer only 256×256 pixels are seen, and so we refer to this setting with k filters. CDk denotes a a Convolution-BatchNorm- as the 256 × 256 discriminator. Dropout-ReLU layer with a dropout rate of 50%. All convo- lutions are 4 × 4 spatial filters applied with stride 2. Convo- 5.2. Training details lutions in the encoder, and in the discriminator, downsample Random jitter was applied by resizing the 256×256 input by a factor of 2, whereas in the decoder they upsample by a images to 286 × 286 and then randomly cropping back to factor of 2. size 256 × 256. All networks were trained from scratch. Weights were 5.1.1 Generator architectures initialized from a Gaussian distribution with mean 0 and standard deviation 0.02. The encoder-decoder architecture consists of: Semantic labels→photo 2975 training images from the encoder: Cityscapes training set [4], trained for 200 epochs, batch C64-C128-C256-C512-C512-C512-C512-C512 size 1, with random jitter and mirroring. We used the decoder: Cityscapes val set for testing. CD512-CD512-CD512-C512-C512-C256-C128 Architectural labels→photo 400 training images from -C64 [31], trained for 200 epochs, batch size 1, with random jit- After the last layer in the decoder, a convolution is ap- ter and mirroring. Data from was split into train and test plied to map to the number of output channels (3 in general, randomly. except in colorization, where it is 2), followed by a Tanh Maps↔aerial photograph 1096 training images function. As an exception to the above notation, Batch- scraped from Google Maps, trained for 200 epochs, batch Norm is not applied to the first C64 layer in the encoder. size 1, with random jitter and mirroring. Images were All ReLUs in the encoder are leaky, with slope 0.2, while sampled from in and around New York City. Data was then ReLUs in the decoder are not leaky. split into train and test about the median latitude of the The U-Net architecture is identical except with skip con- sampling region (with a buffer region added to ensure that nections between each layer i in the encoder and layer n − i no training pixel appeared in the test set). in the decoder, where n is the total number of layers. The BW→color 1.2 million training images (Imagenet train- skip connections concatenate activations from layer i to ing set [35]), trained for ∼ 6 epochs, batch size 4, with only layer n − i. This changes the number of channels in the mirroring, no random jitter. Tested on subset of Imagenet decoder: val set, following protocol of [46] and [23]. U-Net decoder: Edges→shoes 50k training images from UT Zappos50K CD512-CD1024-CD1024-C1024-C1024-C512 dataset [45] trained for 15 epochs, batch size 4. Data from -C256-C128 was split into train and test randomly. Edges→Handbag 137K Amazon Handbag images from 5.1.2 Discriminator architectures [49], trained for 15 epochs, batch size 4. Data from was split into train and test randomly. The 70 × 70 discriminator architecture is: Day→night 17823 training images extracted from 91 C64-C128-C256-C512 webcams, from [21] trained for 17 epochs, batch size 4, After the last layer, a convolution is applied to map to a 1 with random jitter and mirroring. We use 91 webcams as dimensional output, followed by a Sigmoid function. As an training, and 10 webcams for test. exception to the above notation, BatchNorm is not applied to the first C64 layer. All ReLUs are leaky, with slope 0.2. All other discriminators follow the same basic architec- ture, with depth varied to modify the receptive field size: 1 × 1 discriminator: C64-C128 (note, in this special case, all convolutions are 1 × 1 spatial filters) 16 × 16 discriminator: C64-C128