The dominant family of generative models today is the denoising
diffusion probability models (DDPMs) due to its great capability of
generating high-quality, fidelity-preserved images/videos. It's a
generalization of variational auto-encoders (VAEs) that extends from
one-step generation to multiple steps, mapping from a random probability
space (usually Gaussian) to the data distribution space in a denoising
manner. This blog post introduces the basics of VAEs and DDPMs and
provides a simple application of generative models.
Preliminary:
Variational Auto-Encoders (VAEs)
What do we want for
generative models?
The ultimate goal of generative models is to model the real data
distribution where
is a data sample from the complete set of real data. However, it's
usually infeasible for us access the full real dataset , so practically
we use an observed dataset to approximate
the real dataset and thus our goal is to model ths observed data
distributiono
where .
Note that and are two different probability
distributions.
Unfortunately, it's still impractical to model the observed data
distribution for most of
the time especially when the data samples are in a high dimensional
space, like images, 3D objects and videos.
The generative models are purposed for modeling the observed data
distribution from a given
observed dataset , that is,
estimating a new data distribution that approximates . After that, we can draw data
samples from to
generate new data samples.
Introduction to
Auto-Encoders
Before we dive deep into VAEs, we first introduce what auto-encoders
(AEs) are. AEs are a kind of models that first encode the given
input into a
high-dimensional representation , and then decode (which is called the
latent variable) into . In this way, AEs are
expected to learn the most import features embedded in . In other words, AEs
implicitly learn the most significant features from the observed data
distribution, through which we can expect to sample a new data sample
by sampling
a latent in
the latent space.
However, as you may already notice, the problem is, how can we sample
latents from
the latent space. We know nothing about what the latent space is like:
does is follow a normal distribution, a beta distribution, or a uniform
distribution? We don't know! Sampling becomes a problem!
Another problem with AEs is, it's hard to measure the
quality of the latent space. Presumably, if two data samples
are similar, the encoded latent variables should
also be similar, so that they can be further decoded back to the input.
But this is not simply guaranteed!
To put the auto-encoding process more formally, we denote the
parameterized encoder and decoder respectively as and . The auto-encoding process can
be formulated as:
To train the auto-encoder, an loss is generally applied to
enforce
close to the ground-truth value :
For simple forms of
and , we can sometimes
get a closed form solution, but or neural networks, gradient descent is
usually applied to solve
and .
There are many variants of AEs to help learn a better latent space,
and some works even quantize the latent space to get a finite set of
latent variables. We will introduce these variants in future posts.
Understanding
Variational Auto-Encoders
What is the
distribution of the latent space?
As mentioned in the previous section, the biggest problem with AEs is
that we cannot sample a latent in the latent space because we have no
ideas what its distribution is. What if we can constrain, or enforce the
latent distribution to be a simple well-known distribution, such as
Gaussian distribution? Since we know the distribution, we can draw
samples from this distribution and use the decode to generate the data
sample.
Here come to us two remaining questions: - How can we constrain the
latent distribution to be, like a Gaussian distribution? - How can we
map our input data sample to this distribution, and teach the decoder to
generate new data samples from this distribution?
Remember how we teach our AEs to learn decoding from to ? One simply idea is: we
can add some loss to enforce the latants to approach a known
distribution, e.g., a Gaussian distribution. With this loss, we can
rewrite our training objective as:
The second loss, which we call reconstruction loss is
unchanged as we want our AEs to learn to reconstruct the input sample.
The new loss should teach
our AEs to encode the input sample to a latent of a known distribution.
How can we measure the closeness of two distributions? A natural measure
would be the KL Divergence, which is defined as follows:
If both
and
are Gaussians, their KL divergeence has the following closed form
(recall that the probability density of a random variable drawn from
Gaussian is ):
To eliminate the second term, we notice that:
To eliminate the third term, we have:
Combining all three terms, we have:
When , the KL divergence can be simplified as:
It's a very nice simple closed form if we want to minimize the
distance between the distribution of a standard Gaussian
distribution! But wait, we still do not know the distribution of . What we all have are just these
latents ! Wait a
minute... instead of producing latents directly from the encoder,
we can produce the mean and the variance matrix
. If the
elements of are
independent with each other, we can further simplify to , i.e., .
Once we have and
, we can
sample any latent by setting:
This is the so-called reparameterization trick, a technique
that allows for efficient computation of gradients by sampling variables
from a known distribution. More importanty in our case, it opens up the
door for us to illuminate an explicit Gaussian distribution
rather than an unknonwn distribution implicitly defined by the
auto-encoder. This allows us to compute the KL divergence in a very
simple way.
Combining all together, we have our final loss:
Theoretical derivation of
VAEs
You may ask why this can or cannot work? Is the training objective
really effective to push the latent distribution to a Gaussian and can
we really sample a latent and generate a faithful new data point? So, we
need to understand why this works under the hood, through a theorectical
perspective.
Recall that our ultimate goal is to estimate the real data
probability , but
this is infeasible in practice. So in turn we can leverage a model
parameterized by to estimate this
probability, i.e., ,
and then our goal is to maximize the surrogate density.
In fact, the log probability can be rewritten as:
where
can be any distribution parameterized by . The first term is the
so-called evidence lower bound (ELBO) and the second term is
the KL divergence between two distributions. If we can maximize ELBO, we
can equivalently maximize the data log data distribution.
We can slightly rewrite the first term:
is the prior of latent variables. What do you find about the correlation
of this formula to our empirical loss? Yes! They are equivalent if we
take the negative of ELBO: one term minimizing the
reconstruction loss and the other term minimizing the KL
divergence.
The parameter
is the encoder and is the decoder. What
about the prior ?
It can be anythingy, but a common choice would be a standard Gaussian
distribution.
VAEs for Anime Face
Generation
To demonstrate how VAEs work, let's use VAEs to generate anime faces,
with data provided in this
repository. You can use this piece of code. Don't forget to install the
required packages.
However, training VAEs is not a trivial thing. As shown in the
following figures, the generated images are quite blurry, missing
details and lacking diversity.
You can try yourself with the provided code, tune the loss weight for
KL divergence. This may add some diversity but still, the generated
images suffer from intensive blurs.
Understanding
Denoising Diffusion Probability Models (DDPMs)
What is DDPMs
It seems quite simple for VAEs to learn to generate new data samples
from a known distribution like Gaussians, but that's not trivial in
fact. Imagine you have a high-dimensional dataset like real-world images
that are characterized with bunches of features, and in this case, it's
hard to model a perfect mapping from a standard Gaussian distribution to
the real data distribution because the latter is very complex and the
model is challenged to reconstruct the distribution space.
Diffusion models take a clever idea: generating a
data sample from a single latent is hard, why not do it in
multiple steps? That is, we first sample from a Gaussian
distribution, and then decode it into a distribution that is close to
Gaussian, and then decode this latent again, until the final decoded
variable matches the real data distribution. Samping from a sequence of
steps is much simpler compared to directly sampling from the Gaussian
and decoding it into a real data sample in one step. To put it more
simply, our latent
is not a single variable any more, but a sequence of latents
where is the number of diffusion steps.
Instead of encoding an implicit , a sequence of Gaussian
noises are progressively added to the original data point so that the more steps,
the closer the latent variable approaches a true
Gaussian. This is called the forward diffusion process.
Forward diffusion process
Given a data point , we can add a Gaussian noise to it in each of the
steps, producing a sequence of
noisy samples . We would like the probability distribution of
given to be a Gaussian,
which can be explicitly expressed as:
Here is a
variance schedule controlling how much noise should be injected to the
previous sample. Note that we use the reparameterization trick to sample
from
given . Then we
can recursively express in terms of :
From the above equation, we have the full forward diffusion process:
starting with a clean input and a specified timestep
, we can produce a noised input
by injecting a
weighted standard Gaussian noise with predefined
noise strength . Intuitive, a
larger timestep results in a more
noisy data point, so the sequence of noise strengths should meet .
We can represent in terms of :
Backward denoising process
What we're really interested in is the reverse
process: how to get from . To this end, we must estimate ,
the backward denoising probability. However, this is intractable as we
do not know neither nor . Fortunately,
it's tractable to estimate :
Now
is irrelevant to
and has dependency on the timestep (or more specifically the noise ). We can thus
rewrite the mean as .
What is the loss function? Recall that in the VAE section, we have
derived the following lower bound:
If we regard as
and as , we will have:
We can then decompose the lower bound into:
Maximizing the ELBO is equivalent maximing the reconstruction loss
and minimizing KL divergences.
Among the losses, we are
interested in the KL divergence
where
is known given and
and
can be parameterized by neural networks.
Recall that the KL divergence bwtween two Gaussians
and
is:
In our case,
is known, but
is unknown. Similar to , we
can simplify and parameterize as .
Therefore, the final KL divergence can be:
The "ground-truth" mean as derived above is .
As is known at
training time, we can ask the model to predict only the noise
for ,
which can be formulated as .
Plug all these into the KL divergence and ignore variances:
Discarding the coefficient, we have the final loss function for time
step :
That is, during training we randomly sample a time step and a Guassian noise , obtain ,
and predict the noise
to the ground-truth .
As for , it's
found that empirically set it to be or simply performs well enough. Learning a
diagonal results in
unstable training.
Choice of
The remaining question is how to choose the scheduler , or , or .
Recall that the noise is injected through
and .
The simplest scheduler is a linear function of , i.e.,:
with and . An improved scheduler
is the cosine schduler:
This scheduler has an effect of a linear drop-off in the middle of
the process and a more flat change near the extremes of and .
Classifier
Guided Diffusion and Classifier-Free Guidance (CFG)
Classifier guided diffusion
In real-world applications, we usually want to control generation
conditioning on some inputs, denoted by . Before introducing how to guide the
diffusion models, let's visit one important formula:
by incorporating the reparameterization trick .
In this way, we're connecting the derivative of the log-likelihood and
the noise. Intuitively, the back-propagation update direction should be
the inverse of the noise injected to the raw data point.
Now we'd like to estimate the derivative of the joint probability of
:
Thus, becomes the new noise to approximate
for time
step . Here is an off-the-shelf
classifier or classifier trained from scratch. We can also also a weight
to the classifier part:
Classifier-free guidance
It's cumbersome to leverage an extra classifier to guide the
diffusion process. Fortunately, by slightly re-write the score function,
we can bypass the need of an additional classifier model:
is the predictor with guidance as
input and is the
predictor without guidance as input. In this formulation, there is not
any explicit classifier. All we need is a unified diffusion model.
Diffusion Models for
Anime Face Generation
Now, let's train diffusion models on the same anime face dataset.
We use a simple U-Net as the model backbone and leverage the diffusers library to
schedule noise. You can download the code here.
Training diffusion models takes much longer than VAEs before
obtaining satisfactory results. You'll observe loss convergence soon
after training but it does not mean the generated images are good
enough. Instead, As training proceeds, the results are getting better.
Here are some results after training for two epochs on my laptop. If you
will, you can train the model longer for higher-quality images.
Conclusion
In this post, we revisited the basic concepts and principles of
auto-encoders, variational auto-encoders and diffusion models, and how
they evolved through time. Diffusion models are not designed to replace
auto-encoders or VAEs, instead, they can be used in conjunction, such as
the latent diffusion models (LDMs) for a trade-off between compression
and performance. More specifically, VAEs are good at preserving
high-level compositions whereas diffusion models capture more
fine-grained details. It's possible to combine the best of both worlds
through LDMs. There are some applications of diffusion models on
computer games such as scene generation, 3D model generation, UV
generation, etc. There is huge potential for such models to sparkle in
the near future.