Skip to content

Instantly share code, notes, and snippets.

@brannondorsey
Last active January 3, 2022 09:57
Show Gist options
  • Save brannondorsey/fb075aac4d5423a75f57fbf7ccc12124 to your computer and use it in GitHub Desktop.
Save brannondorsey/fb075aac4d5423a75f57fbf7ccc12124 to your computer and use it in GitHub Desktop.
Notes on the Pix2Pix (pixel-level image-to-image translation) Arxiv paper

Image-to-Image Translation with Conditional Adversarial Networks

Notes from arXiv:1611.07004v1 [cs.CV] 21 Nov 2016

  • Euclidean distance between predicted and ground truth pixels is not a good method of judging similarity because it yields blurry images.
  • GANs learn a loss function rather than using an existing one.
  • GANs learn a loss that tries to classify if the output image is real or fake, while simultaneously training a generative model to minimize this loss.
  • Conditional GANs (cGANs) learn a mapping from observed image x and random noise vector z to y: y = f(x, z)
  • The generator G is trained to produce outputs that cannot be distinguished from "real" images by an adversarially trained discrimintor, D which is trained to do as well as possible at detecting the generator's "fakes".
  • The discriminator D, learns to classify between real and synthesized pairs. The generator learns to fool the discriminator.
  • Unlike an unconditional GAN, both the generator and discriminator observe an input image z.
  • Asks G to not only fool the discriminator but also to be near the ground truth output in an L2 sense.
  • L1 distance between an output of G is used over L2 because it encourages less blurring.
  • Without z, the net could still learn a mapping from x to y but would produce deterministic outputs (and therefore fail to match any distribution other than a delta function. Past conditional GANs have acknowledged this and provided Gaussian noise z as an input to the generator, in addition to x)
  • Either vanilla encoder-decoder or Unet can be selected as the model for G in this implementation.
  • Both generator and discriminator use modules of the form convolution-BatchNorm-ReLu.
  • A defining feature of image-to-image translation problems is that they map a high resolution input grid to a high resolution output grid.
  • Input and output images differ in surface appearance, but both are renderings of the same underlying structure. Therefore, structure in the input is roughly aligned with structure in the output.
  • L1 loss does very well at low frequencies (I think this means general tonal-distribution/contrast, color-blotches, etc) but fails at high frequencies (crispness/edge/detail) (thus you get blurry images). This motivates restricting the GAN discriminator to only model high frequency structure, relying on an L1 term to force low frequency correctness. In order to model high frequencies, it is sufficient to restrict our attention to the structure in local image patches. Therefore, we design a discriminator architecture – which we term a PatchGAN – that only penalizes structure at the scale of patches. This discriminator tries to classify if each NxNpatch in an image is real or fake. We run this discriminator convolutationally across the image, averaging all responses to provide the ultimate output of D.
  • Because PatchGAN assumes independence between pixels seperated by more than a patch diameter (N) it can be thought of as a form of texture/style loss.
  • To optimize our networks we alternate between one gradient descent step on D, then one step on G (using minibatch SGD applying the Adam solver)
  • In our experiments, we use batch size 1 for certain experiments and 4 for others, noting little difference between these two conditions.
  • To explore the generality of conditional GANs, we test the method on a variety of tasks and datasets, including both graphics tasks, like photo generation, and vision tasks, like semantic segmentation.
  • Evaluating the quality of synthesized images is an open and difficult problem. Traditional metrics such as per-pixel mean-squared error do not assess joint statistics of the result, and therefore do not measure the very structure that structured losses aim to capture.
  • FCN-Score: while quantitative evaluation of generative models is known to be challenging, recent works have tried using pre-trained semantic classifiers to measure the discriminability of the generated images as a pseudo-metric. The intuition is that if the generated images are realistic, classifiers trained on real images will be able to classify the synthesized image correctly as well.
  • cGANs seems to work much better than GANs for this type of image-to-image transformation, as it seems that with a GAN, the generator collapses into producing nearly the exact same output regardless of the input photograph.
  • 16x16 PatchGAN produces sharp outputs but causes tiling artifacts, 70x70 PatchGAN alleviates these artifacts. 256x256 ImageGAN doesn't appear to improve the tiling artifacts and yields a lower FCN-score.
  • An advantage of the PatchGAN is that a fixed-size patch discriminator can be applied to arbitrarily large images. This allows us to train on, say, 256x256 images and test/sample/generate on 512x512.
  • cGANs appear to be effective on problems where the output is highly detailed or photographic, as is common in image processing and graphics tasks.
  • When semantic segmentation is required (i.e. going from image to label) L1 performs better than cGAN. We argue that for vision problems, the goal (i.e. predicting output close to ground truth) may be less ambiguous than graphics tasks, and reconstruction losses like L1 are mostly sufficient.

Conclusion

The results in this paper suggest that conditional adversarial networks are a promising approach for many image-to-image translation tasks, especially those involving highly structured graphical outputs. These networks learn a loss adapted to the task and data at hand, which makes them applicable in a wide variety of settings.

Misc

  • Least absolute deviations (L1) and Least square errors (L2) are the two standard loss functions, that decides what function should be minimized while learning from a dataset. (source)
  • How, using pix2pix, do you specify a loss of L1, L1+GAN, and L1+cGAN?

Resources

@phillipi
Copy link

Thanks for writing this up!

About specifying the loss: you can pass the command line parameters use_L1=0 to turn off the L1 loss, condition_GAN=0 to switch from cGAN to GAN, and use_GAN=0 to completely turn off the GAN loss.

For example: use_L1=1 use_GAN=1 condition_GAN=0 th train.lua will train a L1+GAN model.

@brannondorsey
Copy link
Author

No problem :) It helped me process some of the stuff in the paper, especially because I'm primarily an artist and certainty no statistician/ML researcher. Thanks for specifying the usage of the loss, I ran into it in the source code and assumed its use was somewhat similar to this, but seeing an example is really helpful. I just commented on an issue in the repo asking about errL1. Is that value always going to be the error of L1 or will it be the error of whatever the loss function you've chosen is, say L1+cGAN?

@phillipi
Copy link

phillipi commented Dec 3, 2016

errL1 is also reports the error of L1. errG and errD report the cGAN error values.

@jedisct1
Copy link

jedisct1 commented Jun 1, 2017

Just wanted to say thank you for this excellent write-up :)

@ieee8023
Copy link

Can you post this on ShortScience.org? Or can I?

http://www.shortscience.org/paper?bibtexKey=journals/corr/1611.07004

@nikhilchh
Copy link

https://github.com/affinelayer/pix2pix-tensorflow/blob/master/pix2pix.py
In the tensorflow implementation I observed -
gen_loss = gen_loss_GAN * 1.0 + gen_loss_L1 * 100.0

why are the default weight 1 and 100 ? Why so less weightage for GAN loss ?

@mszeto715
Copy link

Thank you for this very helpful summary!

I was wondering.. were you able to or do you know of anyone who has tried training on 256by256 and test on 512by512?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment