Goodfellow et al. (2014) proposed training two networks against each other: a Generator that produces fake data, and a Discriminator that tries to tell real from fake. The resulting minimax optimization --- Generative Adversarial Networks --- became one of the most influential frameworks in generative modeling.
This post covers the mathematical core: the minimax objective, Jensen-Shannon divergence, the closed-form optimal discriminator, and the Nash equilibrium interpretation.
The Minimax Game
The GAN framework has two players:
- Generator $G$: Maps a noise vector $z \sim p_z(z)$ into the data space. It tries to produce samples the Discriminator cannot distinguish from real data.
- Discriminator $D$: Takes a sample $x$ and outputs $D(x) \in [0, 1]$, the probability that $x$ is real rather than generated.
They optimize a shared value function:
$D$ maximizes this: push $D(x) \to 1$ for real data and $D(G(z)) \to 0$ for fakes. $G$ minimizes it: push $D(G(z)) \to 1$ so that $\log(1 - D(G(z))) \to -\infty$.
Jensen-Shannon Divergence
The minimax game implicitly minimizes the Jensen-Shannon divergence between the real distribution $p_{\text{data}}$ and the generated distribution $p_g$:
JSD has two useful properties that KL divergence lacks: it is symmetric, and it is bounded ($0 \leq \text{JSD} \leq \log 2$). $\text{JSD} = 0$ iff $p_g = p_{\text{data}}$. The symmetry matters practically --- KL divergence penalizes $G$ differently depending on whether it places mass where the data has none (mode invention) versus ignores modes that exist in the data (mode dropping). JSD treats both failure cases equally, which aligns better with the adversarial training dynamic where both players have symmetric roles.
The Optimal Discriminator
For a fixed $G$, the optimal $D^*$ has a closed form. Take the functional derivative of $V(D, G)$ with respect to $D(x)$ and set it to zero:
So the optimal discriminator outputs the density ratio between the two distributions. When $p_g = p_{\text{data}}$, $D^*(x) = \frac{1}{2}$ everywhere --- the discriminator is reduced to a coin flip. This closed-form result is what connects the GAN game to a well-defined divergence minimization problem. Without it, the value function would just be an arbitrary two-player loss with no guarantee that $G$ is learning anything meaningful about the data distribution.
Substituting $D^*$ back into the value function:
This confirms that training $G$ against the optimal $D$ is equivalent to minimizing the JSD between $p_{\text{data}}$ and $p_g$.
Training Dynamics and Nash Equilibrium
Training alternates between updating $D$ and $G$. In game-theoretic terms, the solution is a Nash equilibrium --- neither player can improve by unilaterally changing strategy.
The equilibrium is $p_g = p_{\text{data}}$, $D(x) = \frac{1}{2}$ everywhere. In practice, reaching it is hard:
- Mode collapse: $G$ learns to produce only a few samples that reliably fool $D$, ignoring the rest of the distribution.
- Vanishing gradients: If $D$ gets too strong too fast, $G$ receives near-zero gradients ($\log(1 - D(G(z))) \to 0$) and stalls.
- Oscillation: The two networks chase each other around the equilibrium without settling.
A practical fix from the original paper: instead of having $G$ minimize $\log(1 - D(G(z)))$, have it maximize $\log(D(G(z)))$. This gives stronger gradients early on when $D(G(z)) \approx 0$.
The Non-Saturating Loss and Gradient Behavior
The gradient issue is worth unpacking. Under the original minimax objective, when $D$ is confident that a generated sample is fake ($D(G(z)) \approx 0$), the gradient of $\log(1 - D(G(z)))$ with respect to $G$'s parameters becomes vanishingly small. The Generator receives almost no learning signal precisely when it needs the most guidance --- early in training when its outputs look nothing like real data.
The non-saturating alternative $-\log(D(G(z)))$ fixes this. Its gradient is $-1 / D(G(z))$, which is large when $D(G(z))$ is small. The Generator gets its strongest kick exactly when the Discriminator is most confident about rejecting fakes. Importantly, the equilibrium point is the same --- both objectives drive $G$ toward $p_g = p_{\text{data}}$ --- but the gradient landscape en route to that equilibrium is far more navigable.
This distinction between equivalent optima and different optimization landscapes is a recurring theme in deep learning: the loss you train with is not just about what it converges to, but how it behaves during the journey.
Up Next
In Part 2, we implement these ideas in PyTorch --- a Vanilla GAN with fully-connected layers and a DCGAN with convolutional structure, paying close attention to the architectural details that determine whether training converges or falls apart.