In 2014, Ian Goodfellow introduced a radical idea: instead of training a single network to generate data, train two networks that compete against each other. A Generator that tries to produce fake data, and a Discriminator that tries to distinguish real from fake. This adversarial game, formalized as a minimax optimization, launched the field of Generative Adversarial Networks (GANs) and fundamentally reshaped generative modeling.
In Part 1 of this mini-series, we will deconstruct the mathematical core of GANs: the minimax objective, Jensen-Shannon divergence, the proof of the optimal discriminator, and why training GANs is equivalent to searching for a Nash equilibrium.
The Minimax Game
The GAN framework consists of two players:
- Generator $G$: Takes a random noise vector $z \sim p_z(z)$ and maps it to the data space. Its goal is to produce samples that are indistinguishable from real data.
- Discriminator $D$: Takes a data sample $x$ and outputs a probability $D(x) \in [0, 1]$ that $x$ came from the real data distribution rather than from $G$.
The two networks play a minimax game with the value function:
The Discriminator wants to maximize this objective: assign high probability to real data ($D(x) \to 1$) and low probability to fake data ($D(G(z)) \to 0$). The Generator wants to minimize it: make $D(G(z)) \to 1$ so that $\log(1 - D(G(z))) \to -\infty$.
Jensen-Shannon Divergence
Why does this particular objective work? It turns out that the minimax game implicitly minimizes the Jensen-Shannon divergence between the real data distribution $p_{\text{data}}$ and the generated distribution $p_g$.
The JSD is defined as:
Unlike the KL divergence, JSD is symmetric and bounded: $0 \leq \text{JSD} \leq \log 2$. When $p_g = p_{\text{data}}$, $\text{JSD} = 0$, and the Generator has perfectly learned the data distribution.
The Optimal Discriminator
For a fixed Generator $G$, the optimal Discriminator $D^*$ can be derived analytically. Taking the functional derivative of $V(D, G)$ with respect to $D(x)$ and setting it to zero:
This is an elegant result. At optimality, the Discriminator outputs the density ratio between the real and generated distributions. When $p_g = p_{\text{data}}$, we get $D^*(x) = \frac{1}{2}$ everywhere --- the Discriminator is reduced to random guessing.
Substituting $D^*$ back into the value function yields:
This proves that training the Generator against the optimal Discriminator is equivalent to minimizing the JSD between $p_{\text{data}}$ and $p_g$.
Training Dynamics and Nash Equilibrium
GAN training alternates between updating $D$ and updating $G$. In game theory, the solution to a minimax game is a Nash equilibrium --- a state where neither player can improve by unilaterally changing their strategy.
For GANs, the Nash equilibrium occurs when $p_g = p_{\text{data}}$ and $D(x) = \frac{1}{2}$ for all $x$. However, reaching this equilibrium is notoriously difficult in practice:
- Mode collapse: The Generator learns to produce only a few high-confidence samples rather than covering the full data distribution.
- Training instability: If the Discriminator becomes too strong too quickly, the Generator receives vanishing gradients ($\log(1 - D(G(z))) \to 0$) and cannot learn.
- Oscillation: The two networks may chase each other around the equilibrium without converging.
In practice, Goodfellow proposed a modification: instead of minimizing $\log(1 - D(G(z)))$, the Generator maximizes $\log(D(G(z)))$. This provides stronger gradients early in training when $D(G(z)) \approx 0$.
Up Next
With the mathematical foundations laid, we understand what a GAN is optimizing and why it works. In Part 2, we will implement these ideas in pure PyTorch --- building both a Vanilla GAN and a Deep Convolutional GAN (DCGAN) from scratch, with careful attention to the architectural choices that make training stable.