Back to GANs Hub

Deconstructing GANs from Scratch

Part 1: The Math of Adversarial Training

Introduction

Goodfellow et al. (2014) proposed training two networks against each other: a Generator that produces fake data, and a Discriminator that tries to tell real from fake. The resulting minimax optimization --- Generative Adversarial Networks --- became one of the most influential frameworks in generative modeling.

This post covers the mathematical core: the minimax objective, Jensen-Shannon divergence, the closed-form optimal discriminator, and the Nash equilibrium interpretation.

The Minimax Game

The GAN framework has two players:

They optimize a shared value function:

$$ \min_G \max_D \; V(D, G) = \mathbb{E}_{x \sim p_{\text{data}}(x)}[\log D(x)] + \mathbb{E}_{z \sim p_z(z)}[\log(1 - D(G(z)))] $$

$D$ maximizes this: push $D(x) \to 1$ for real data and $D(G(z)) \to 0$ for fakes. $G$ minimizes it: push $D(G(z)) \to 1$ so that $\log(1 - D(G(z))) \to -\infty$.

Jensen-Shannon Divergence

The minimax game implicitly minimizes the Jensen-Shannon divergence between the real distribution $p_{\text{data}}$ and the generated distribution $p_g$:

$$ \text{JSD}(p_{\text{data}} \| p_g) = \frac{1}{2} D_{\text{KL}}\left(p_{\text{data}} \;\middle\|\; \frac{p_{\text{data}} + p_g}{2}\right) + \frac{1}{2} D_{\text{KL}}\left(p_g \;\middle\|\; \frac{p_{\text{data}} + p_g}{2}\right) $$

JSD has two useful properties that KL divergence lacks: it is symmetric, and it is bounded ($0 \leq \text{JSD} \leq \log 2$). $\text{JSD} = 0$ iff $p_g = p_{\text{data}}$. The symmetry matters practically --- KL divergence penalizes $G$ differently depending on whether it places mass where the data has none (mode invention) versus ignores modes that exist in the data (mode dropping). JSD treats both failure cases equally, which aligns better with the adversarial training dynamic where both players have symmetric roles.

The Optimal Discriminator

For a fixed $G$, the optimal $D^*$ has a closed form. Take the functional derivative of $V(D, G)$ with respect to $D(x)$ and set it to zero:

$$ D^*(x) = \frac{p_{\text{data}}(x)}{p_{\text{data}}(x) + p_g(x)} $$

So the optimal discriminator outputs the density ratio between the two distributions. When $p_g = p_{\text{data}}$, $D^*(x) = \frac{1}{2}$ everywhere --- the discriminator is reduced to a coin flip. This closed-form result is what connects the GAN game to a well-defined divergence minimization problem. Without it, the value function would just be an arbitrary two-player loss with no guarantee that $G$ is learning anything meaningful about the data distribution.

Substituting $D^*$ back into the value function:

$$ V(D^*, G) = -\log 4 + 2 \cdot \text{JSD}(p_{\text{data}} \| p_g) $$

This confirms that training $G$ against the optimal $D$ is equivalent to minimizing the JSD between $p_{\text{data}}$ and $p_g$.

Training Dynamics and Nash Equilibrium

Training alternates between updating $D$ and $G$. In game-theoretic terms, the solution is a Nash equilibrium --- neither player can improve by unilaterally changing strategy.

The equilibrium is $p_g = p_{\text{data}}$, $D(x) = \frac{1}{2}$ everywhere. In practice, reaching it is hard:

A practical fix from the original paper: instead of having $G$ minimize $\log(1 - D(G(z)))$, have it maximize $\log(D(G(z)))$. This gives stronger gradients early on when $D(G(z)) \approx 0$.

The Non-Saturating Loss and Gradient Behavior

The gradient issue is worth unpacking. Under the original minimax objective, when $D$ is confident that a generated sample is fake ($D(G(z)) \approx 0$), the gradient of $\log(1 - D(G(z)))$ with respect to $G$'s parameters becomes vanishingly small. The Generator receives almost no learning signal precisely when it needs the most guidance --- early in training when its outputs look nothing like real data.

The non-saturating alternative $-\log(D(G(z)))$ fixes this. Its gradient is $-1 / D(G(z))$, which is large when $D(G(z))$ is small. The Generator gets its strongest kick exactly when the Discriminator is most confident about rejecting fakes. Importantly, the equilibrium point is the same --- both objectives drive $G$ toward $p_g = p_{\text{data}}$ --- but the gradient landscape en route to that equilibrium is far more navigable.

This distinction between equivalent optima and different optimization landscapes is a recurring theme in deep learning: the loss you train with is not just about what it converges to, but how it behaves during the journey.

Up Next

In Part 2, we implement these ideas in PyTorch --- a Vanilla GAN with fully-connected layers and a DCGAN with convolutional structure, paying close attention to the architectural details that determine whether training converges or falls apart.