Generative Adversarial Networks (GANs)
GAN stands for Generative Adversarial Network. If you are not already familiar with GANs, I guess that doesn’t really help you, doesn’t it? To make it short, GANs are a class of machine learning systems, more precisely a deep neural network architecture (you know, these artificial “intelligence” things) very efficient for generating… stuff! Statistically speaking, GANs can learn to mimic any distribution of data, thus producing convincing – but still fake – data points from the learnt distribution.
They are especially good at generating images, but they could in fact be used to generate anything, for example 3d models, text or even music. But what makes GANs so interesting is the general idea behind them, both simple and smart at the same time.
GAN: A simple but smart idea
When introducing GANs to someone that has never heard about it, I really like the analogy of the counterfeiter and the cop. The counterfeiter is learning how to create fake money, and the cop tries to spot any fake money. At the beginning, they are both beginners with no experience: the counterfeiter produces very poorly imitated money, and the cop is incapable of discerning fake from real money with good accuracy.
However, both of them are learning and improving at their tasks. The cop starts being better at discerning fake from real money, so that the counterfeiter also requires to improve at producing convincing fake money in order to fool the cop. Eventually, the counterfeiter will improve so much that he is able to create almost perfect-looking fake-money so that the cop cannot distinguish it with real money anymore.
As you can see, the idea behind GANs is rather simple: let’s have two mathematical models (artificial neural networks in practice) compete against each other in order to have them both improve over time without supervision. The first one, called the generator (the counterfeiter in our previous example), produces some data from only pure random inputs. The second one, called the discriminator (the cop in our example), will be shown both real data (from a real dataset) and fake data (obtained from the generator), and tells if it is real or fake data.
GANs in practice
In practice, you only require a large enough set of samples from your target data distribution and some knowledge on how to train conventional artificial neural networks.
There are numerous articles in the wild about GANs, how they work, the maths behind them, which architecture to choose and how to implement them using PyTorch, TensorFlow or any other machine learning framework. Instead of writing one more article about the basics behind GANs, here are some resources that you might find useful if you want to know how to do your own GAN implementation:
- The original paper by Ian Goodfellow et al., and his video tutorial from NIPS 2016
- A very good introduction to GANs. The same author as many more articles on GANs, such as ways to improve their training.
- Some tips about GANs architectures
- Some other tips and tricks for GANs training
- PyTorch implementations of GANs and multiple derivatives. Very useful if you don’t know where to start (but be sure to try implementing your own for practice!)
- Avoiding checkerboard artifacts by replacing transpose convolution
After reading all these, you might have everything necessary to start designing and training your very first GAN. Now all that you need is to find an application that motivates you!
Some cool applications of GANs (and their derivatives)
GANs have many derivatives, and researchers in the machine learning field love to give “cute” little names to their own neural network architecture, like BEGAN, DCGAN, CycleGAN, GTPK-UP-GAN-HD or whatever they found inspiring. They actually all revolve arround the same original principle of GANs, but also add very nice variations to perform specific tasks or improve training (which is the hard part with GANs).
Here are some direct applications of GANs or their derivatives:
Creating realistic faces
This might be the most known application of GANs: generating fake celebrity faces with an impressive realism.
If you are curious and have some time to spend, you should have a look at thispersondoesnotexist.com: each time you refresh the web page, you will be shown a new generated face.
I never get tired at looking at the result when interpolating in the latent space (i.e. the underlying representation used as input by the generator, generally a hundred of values). I especially like seeing these glasses pop out of nowhere and then disappear again…
Such application also raises many concerns about fake content generation on the Internet. What if we were able to create millions of fake profiles indiscernable from real ones?
Making anime faces
GANs can not only generate realistic faces but also stylized ones! Here, researchers used hundreds of thousands of anime characters images to train a GAN to generate anime faces… with great success!
There is even a demo that you can try online at make.girls.moe. They used metadata associated with each image to label them in order to be able to generate faces with specific user-requested features, such as hair color, glasses, etc.
Changing the pose of a person
Given an image of a person in some pose and only the joint points of the desired new pose, researchers were able to generate an image of the same person in the requested pose:
As with face generation, we should be aware of potential misuses, especially for image or video manipulation. For now however, there are still some hints that can be used to discriminate fake from real images.
Turning horses into zebras and vice-versa
I have no idea who had this idea first, but the results are quite interesting:
Such derivative of GANs is called a CycleGAN, as it is performing a cycling transformation using two generators (one that generates zebras from horses, and the other horses from zebras) and two discriminators (one that discriminates horses and the other zebras). Some guy even tried a face <-> ramen transformation!
High-resolution video synthesis from semantic data
This might be one of the most impressive applications so far: generating high-resolution video from only semantic label maps. That is to say that they generated high-resolution images only from knowing that such area had vegetation, another was the road, another a car, etc. See by yourself:
There are numerous other impressive applications of GANs, such as 3d obect generation, or image super-resolution, but I will stop here since this is not the topic of this article! If you are looking for more, you can check this article. Anyway, now that you saw what GANs are capable of, let’s dive into our subject!
Generating 2D map tiles
As a first experiment with GANs, I chose a simple (while not-so-simple in fact as we will see later) task of generating small 2d map tiles.
2D map tiles dataset
If you are not familiar with 2d map tiles, here are some samples from my dataset:
As you can see, they are small pictures that you can compose together to create vast maps, generally used in video games. They often have a colored background to encode transparent pixels (we will not work with transparency for now, but I might look into it later). In my dataset, most backgrounds were filled with a blue color, but some others were filled with red, white, green or pink. We could have preprocessed all the tiles to have the same background color, but I wanted to keep the original tiles for now.
Tiles are generally packed as tilesets regrouping similar tiles or tiles made for a specific environment, which makes a lot more sense than when viewed individually:
Although they do not especially have to be squares (they could actually even be hexagonal), I chose to consider only square tiles with a size of 32 pixels because this is the most common format. Indeed, this tile format was made popular by a famous game making software that had its glory in the past: RPG Maker XP. It is very easy to find dozens of tilesets made for this software. In total, I obtained about 30,500 individual tiles.
Unfortunately, since I don’t have any copyright information about these tilesets, I chose not to publicly release my dataset. You can however recreate a similar one very quickly by searching for tilesets in Google Image and I even uploaded the Processing script I used to split tilesets into individual tiles on the github page of this article. This script also makes sure that empty tiles (which are often present in tilesets) are discarded.
This task seems like a simple task given the results achieved on generating realistic faces or anime faces. However it is actually a lot more challenging that what it mights look like:
- The dataset is very heterogeneous: tiles are actually very different from one to another, while faces or other commonly used datasets mostly look the same.
- This is stylized pixel art: in pixel-art, each pixel counts and has an important impact on the final image. Reproducing this is quite challenging.
- We have few samples: while 30,500 tiles might seem a lot, given the variety of the tiles it is in fact limiting.
- The data is not labeled: when dealing with diversified data, it generally helps to provide a class label (for instance “rock” or “wood” encoded as binary values) as additional input of the GAN to improve its training. Here I had no such label.
- I have a poor laptop GPU: training GANs on a CPU sounds like pure madness today so you require a descent GPU on which to iterate your training algorithm. Mine was not specifically designed for computation, is actually several years old and has quite limited memory. It usually took about a day to get the first interesting results, which was quite limiting when optimizing hyper-parameters.
- This is GANs: training GANs is hard and require a lot of fine-tuning of hyper-parameters. This requires a lot of time which I don’t have. So there will be room for improvement!
As you can see, most challenges come from the diversity of the tiles. One simple solution could have been to only keep some specific classes of tiles, such as only texture tiles, or only vegetation ones, etc. But I wanted a difficult and challenging task to observe the limitations of GANs.
GAN architecture: (Deep) Convolutional
Nothing too fancy here. I used a classic Deep Convolutional Generative Adversarial Network (DCGAN), although it is not so deep given our samples size (32 pixels). Some precisions through regarding my implementation:
- For the very first expanding layer of my generator, I used a transpose convolution layer instead of a reshape or fully connected one. Because I can. And also because it is more or less the same.
- My architecture was made with pictures of size being a power of 2 in mind. That is why you will see no fully connected layers at the top or bottom of my generator and discriminator.
- Most of my convolution layers basically double the number of filters and divide by 2 the size of their input. The opposite was done for transpose convolutions.
- The number of intermediate convolution or transpose convolution layers is computed from the size of the image so that no expansion is performed. No special reason for this appart than simplicity.
- I did not fine-tune the architecture. Feel free to try it and change it your own way!
Training GANs is hard (have I already said that before?). It requires a lot of fine-tuning, it is mostly empirical and relies on a wide asset of clever and not-so-clever tricks.
Compared to other deep neural network architectures, GAN training suffer a lot from:
- Non-convergence: the model never converges, the parameters start to oscillate and the training becomes unstable.
- Mode collapse: the generator produces only a few modes, i.e. always generates the same kind of data without representing the full diversity of the data distribution.
- Vanishing gradient: the discriminator loss rapidly tends toward zero (i.e. it becomes good at discriminating too fast) causing the generators gradient to vanish and learn nothing.
I won’t detail the basic training process of a GAN, i.e. the losses and optimization methods used, as it is explained in all the other ressources available about GANs (see the links I provided above). I will however try to briefly introduce each trick I applied for this specific application, and if possible explain why I chose to use it. You can also just look at the source code and see for yourself.
Disclaimer #1: These tricks are not general guidelines: what have been working for me might as well not work for you, but it might be worth a try…
Disclaimer #2: These tricks are not sorted by any order of importance.
Trick #1: Use normal distribution to sample latent space
OK, this is not really a trick, everybody does that, right? Well actually, no. You could also choose to sample from a uniform distribution, and expect similar results, since theoretically you could map one distribution to another (or at least approximate it). However, it does make difference in practice and everyone seem to agree on that. Strangely enough, it is hard to find theoretical evidence on why this is a better choice. This answer might be a good point to start from if you want to deep further into this.
Trick #2: Mini-batch size
Training artificial neural network is generally done using mini-batches (i.e. the network is updated using the averaged gradient computed on a set of samples), as it is known to improve training. Indeed, feeding samples one-by-one means that the model will update many times before seing the whole data, which could lead to slow or noisy training. We might also consider training our network on a full batch (i.e. the whole dataset at once). However, this could result in the network being stuck in a subtoptimal local optima.
Using mini-batches is a good compromise that injects enough noise in each gradient update to help escape local optima, while still achieving relatively fast convergence. Memory limitation is often presented as being the motivation for mini-batches (i.e. the whole dataset of millions of images could not fit entirely in memory), but is not the actual reason since the gradient could be accumulated in several passes. Mini-batches also helps to fight against mode collapsing since the generated data presented to the discriminator is diversified.
Now, how to choose the mini-batch size? As with the other hyper-parameters, there is probably an optimal size for a specific application. I have seen many people using weird heuristics based on dataset size. Well you need a way to choose a size, so why not this one?. Ideally, I believe that each mini-batch should be as much representative of the whole dataset as possible. Meaning that if you have 10 different classes of data samples, each mini-batch should ideally contain at least one instance of each class, and even better the same number of instances for each class.
In practice this is hard to achieve. But my personal way is to look at the data and have a feeling of its diversity (if it is not labeled) and then pick a batch size, so that given the number of samples and the number of classes, chances are high that all classes will be present in each batch (did you expect an actual formula?).
Since the diversity of our dataset is high, I wanted the mini-batch size to be quite high as well. In practice, I chose a mini-batch size of 128 samples, especially because I wanted to update my gradient in one pass and could not fit more samples in video memory (while again, this is not really a true issue).
Side note: there is an interesting relation between learning rate and batch-size. You might want to consider training your neural network sample-by-sample with a slowly decaying learning rate. In practice however, it seems better to increase the batch size rather than decreasing the learning rate.
Trick #3: Avoid using sparse gradients
Especially for the generator. Most implementations use Rectified Linear units (ReLU) in the generator, which have a null gradient for values below zero. That could prevent a unit from actually learning anything. Here I preferred using leaky Rectified Linear units (LeakyReLU) with a leaky coefficient of 0.1 to have a non-null gradient everywhere.
Trick #4: Use DropOut
For the generator and the discriminator. DropOut consists in… well… dropping out… a certain percentage of units (hidden or visible), randomly chosen at each iteration. Like if they were never here to start with! This is used as a regularization method to prevent overfitting and encourage robustness through redundancy. Here I used DropOut with a probability of about 25%.
“Fun” fact: it is actually patented by Google.
Trick #5: Use batch normalization
But change it for instance normalization in the generator if the generated samples seem too correlated. Indeed, batch normalization introduces an inter-sample dependency since each sample is normalized with respect to the mini-batch it belongs too. Batch normalization generally helps with mode collapse and poor network initialization. I personally did not have to use instance normalization, most probably because my batch-size was relatively high.
Trick #6: Replace transpose convolution
With upsampling and standard convolution. Indeed, if the stride of a transpose convolution does not divide its kernel size, then you are likely to observe checkerboard artifacts. This is very well explained in this article, with interactive illustrations!
Trick #7: Randomly soften your labels
Instead of providing label 0 for fakes and 1 for real samples when training the discriminator, pass it a random value between 0 and 0.1 for fakes, and one between 0.9 and 1 for real. Randomness leads to noise. Noise leads to no overfitting. No overffiting leads to good generalization. This is my “Fear leads to anger. Anger leads to hate. Hate leads to suffering” of machine learning.
Trick #8: Never forget
Or not too much. During training, feed the discriminator with both recent and past generated images. This prevents the discriminator from being too greedy in defeating what the generator is currently generating. In practice, this is achieved by keeping a buffer of previously generated samples and occasionally swap them with the newly generated. Here I kept a buffer twice the size of my mini-batch size with a 50% swap probability.
Trick #9: Use different latent space samples
When training the discriminator and the generator. Just sample new latent vectors when training your second model. This helps fighting against mode collapse since both models are not updated on the same data.
Probably the part you are waiting for…
Generated 2D tiles samples
Without further introduction, here are some samples generated after about 300 epochs (and 2 days of screaming GPU fans):
So where to start? First, I think this looks pretty good or at least pretty promising. The model was capable of capturing and producing results with some fine details: a wood sign, various brick walls, various wood floors, roof tiles, some vegetation, some transparency patterns often used in 2d tiles (such as the first corner), a stick or whatever this line is, an unidentified object (4th row, 4th column) etc.
Of course, this is far from being perfect: this is not very stylized (look at the vegetation for example), and my dataset contained tons of individual objects such as barrels that the GAN was unable to reproduce. There is still room for improvement! This was expected given the difficulty of the task.
Actually this was the result I got after training my first GAN version on the same dataset, for about 400 epochs:
Oh yeah, using last machine learning cutting-edge techniques to generate… a color palette. What a deception! But it wasnt over yet. This was my second attempt, after about 400 epochs:
And that’s, folk, is what mode collapse looks like. Although we can clearly see that this new version was starting to capture some fine details of the original images. But it kept generating the same patterns. The third one (or maybe forth? I don’t remember…) was the good one. Or at least the best of those.
I am not ashamed of admitting it, the final examples I shown above were handpicked. Here you can see how the generated samples evolved for different epochs. These samples are all randomly picked and unfortunately vary from one epoch to another (Next time I will save the intermediate models, I promise!):
Latent space interpolation
It’s always captivating to take two random latent vectors and interpolate between them to see how the generated samples transition from one to the other. This gives some insights into the underlying representation learnt by the model. But here, we will just do it for fun. To interpolate between latent vectors, I used spherical interpolation since our samples were taken from a normal distribution. In the following image, each row corresponds to one interpolation from the sample at the left to the one at the right:
That’s all for now, don’t hesitate to give me some feedbacks or suggestions in the comment section!
I might give this another try if I have enough free time (which I don’t, let’s face it…), especially trying to play with transparency and slightly different architectures. But before that I might look into implementing conjugate gradient optimization for PyTorch, since I had very good experience with it in the past (although I seem to be the only one using it for deep learning… why???). I also have some ideas about texture generation that I am eager to try. Stay tuned!