State-of-the-Art Image Generative Models

I have aggregated some of the SotA image generative models released recently, with short summaries, visualizations and comments. The overall development is summarized, and the future trends are speculated. Many of the statements and the results here are easily applicable to other non-textual modalities, such as audio and video.


The papers we featured in this post belong to either of the following paradigms of SotA image generative models:

  • VAE:
    • VDVAE and VQVAE variants offer SotA diversity (NLL or recall). Furthermore, the sampling speed of VAEs without discrete bottleneck (e.g. VDVAE) is as fast as that of GAN. While VDVAE and VQVAE offer suboptimal quality, VQGAN offers SotA quality.
  • GAN:
    • StyleGAN2 offers SotA quality and fast sampling speed. However, its diversity has not been shown to be on par with VAEs.
  • Diffusion models:
    • DDPMv2 offers SotA quality and diversity. However, its sampling (not training) speed is substantially slower than GAN.
“Speed” stands for sampling speed (not training speed). VAE’s sampling speed is based on the ones without discrete bottleneck.
  • Future trends:
    • VDVAE combined with the ideas from GAN (e.g. a discriminator, StyleGAN2 architecture) may offer SotA quality while maintaining its diversity.
    • Diffusion models may further improve and offer substantially better quality-diversity-compute trade-off with reasonably fast sampling.
  • Evaluation:
    • Proper evaluation of image models is crucial. It is necessary to evaluate a model on both quality (e.g. FID, precision, PPL) and diversity (e.g. recall, NLL) to observe its quality-diversity trade-off rather than just one.
      • In particular, NLL and reconstruction error don’t correlate well with the quality of generated images.
      • Personally, I’m not certain if NLL actually captures diversity as well as recall or Classification Accuracy Score.
    • In order to avoid the effect of overfitting, models need to be evaluated on a large enough dataset, such as ImageNet rather than CIFAR-10.
  • Scaling:
    • The dataset has been growing dramatically in terms of volume and diversity as in NLP (e.g. DALL-E). We’re in need of a massive dataset that is open and multimodal.
    • As demonstrated by OpenAI’s scaling results, one should scale up model size, use early-stopping and reduce the number of epochs to obtain the optimal performance for a give amount of computes.
      • Start using model/pipeline-parallelism for billion-scale image models.
      • For moderately large models, several tens of millions of images and a single epoch seem to be sufficient for compute-optimal training.

Disclaimer: This post is not meant to be exhaustive, and I do not discuss some important paradigms such as flow-based models and autoregressive models. The reason why I omitted autoregressive models, in particular, is because the models with SotA NLL tend to be not evaluated in terms of quality metrics (e.g. FID, precision) or have unscalable inference speed w.r.t. the number of pixels.

Acknowledgement: I would like to thank Alex Nichol and Yang Song for their valuable feedbacks about their work and Ethan Caballero for his valuable feedbacks on this article!

Table of Contents with Summary



Diffusion Models:

Very Deep VAEs Generalize Autoregressive Models and Can Outperform Them on Images

VDVAE’s architecture essentially combines the architecture of U-Net and VAE.


  • VDVAE is the first VAE that performs on par with SotA autoregressive image models in terms of NLL.
  • Due to the lack of discrete bottleneck, its sampling speed is as fast as GAN unlike VQVAE.
VDVAE performs nearly as well as Sparse Transformer with comparable amount of computes.


  • This model should have also been evaluated with metrics other than NLL, such as FID, to see a more accurate picture of its quality-diversity trade-off. VQGAN paper measured VDVAE’s FID and showed that VDVAE’s quality is not competitive (as with many other likelihood-based models with SotA NLL), and that VQGAN outperforms VDVAE in terms of FID, but it’s unclear how their actual quality-diversity trade-off compares.
  • In the original VQVAE paper, the traditional VAE performs slightly better than VQVAE in terms of NLL in a head-to-head comparison. Hence, it is quite possible that VDVAE would outperform VQVAE variants.
  • I observed that replacing DMOL with MSE loss improves the blurriness at least for early training phase. The author may have used DMOL primarily to measure its NLL.
  • I have also observed that the higher resolution layers of VDVAE can be replaced with the residual blocks without noticeable change in image quality.
  • Also, I have observed that the quality was more sensitive to the width of higher-resolution layers than the depth, but I’m not certain if this is true in a larger scale. This may explain the poor generation of FFHQ-1024 in the original paper, given that the width of higher-resolution layers was set to be much smaller in this particular case.
  • The original implementation does not use mixed precision. I have tried to train with TPUs as well mixed precision with GPUs, both of which resulted in divergence. Hopefully, this will be fixed soon.
  • VDVAE has been successfully applied to audio domain to achieve substantionally better performance with non-autoregressive generation:

Relevant links:

Zero-Shot Text-to-Image Generation

tl;dr: Scales up a text-to-image Transformer with discrete VAE to 12B parameters and 250M text-image pairs to achieve an unprecendented level of generation quality-diversity.


  • The technical contribution of this work is primarily concerned with improving the conventional distributed training and mixed-precision training on GPUs to make them possible at this scale.
  • The result shows that, while Transformer and discrete VAE can produce an accurate image for a given caption, despite its sheer scale, the generated image still lacks details due to the use of L1 loss unlike what is possible with GAN, which motivates for the use of a discriminator.
Original vs. reconstruction. Even the scale of DALL-E does allow to produce a sharp image, which motivates the use of a discriminator or some other complement.

Relevant links:

Taming Transformers for High-Resolution Image Synthesis

tl;dr: Proposes VQGAN, which combines VQVAE (w/ Transformer) and GAN’s discriminator to outperform BigGAN in terms of quality.

VQGAN vs. DALL-E. This shows that VQGAN’s training, the use of discriminator, leads to sharper samples.


  • This paper shows that, in terms of FID on standard datasets, VQGAN > BigGAN > VDVAE. Furthermore, it also shows VQGAN outperforms VQVAE-2 in terms of FID.
  • Nevertheless, this, along with DC-VAE and other similar papers, is a very encouraging result for fixing the poor quality of VAE.
Reconstruction FID on ImageNet. This shows that VQGAN achieves substantially better FID for a given budget of codebook size, which roughly determines the per-sample computes for the model of the same size.

Relevant links:

Dual Contradistinctive Generative Autoencoder

tl;dr: Proposes DC-VAE, which combines VAE, a discriminator and contrastive learning to achieve competitive FID for VAE.

Original image (top) vs. reconstructed image (bottom). Due to the use of a discriminator, each image is sharper than the typical VAE samples.


  • DC-VAE updates the generator with a modified version (for contrastive learning) of the following perceptual loss that is an adversarially trained discriminator, which makes more sense than the usual GAN loss given that is has an access to both original and reconstructed images.
  • While its FID on CelebA-256 slightly lags behind that of VQGAN (~15 vs ~10), it is respectable given that the former is much smaller in parameter count.
  • In my opinion, their approach makes most sense as a way to add a discriminator to VAE over other similar approaches due to their use of the adversarially trained perceptual loss and no pixel-level loss (e.g. MSE).
    • However, I’m not completely certain if no use of pixel-level loss would be ultimately better.
  • As a related approach, Soft-IntroVAE instead uses the encoder as a discrimnator.

Relevant links:

Analyzing and Improving the Image Quality of StyleGAN


  • Improves StyleGAN by a series of architectural improvements to achieve SotA quality (FID & precision) as well as improved diversity (recall) on high-resolution images.


  • There still is no clear evidence that StyleGAN2 offers the diversity on par with the SotA models of other paradigms, or that one can close this gap using a purely GAN-based model.
  • This motivates for unification of GAN, in particular StyleGAN2 architecture, with other models with competitive NLL or recall. For example, NCSN++ attempts some of the components used in StyleGAN2, such as scaled skip connection. VQGAN uses a discriminator, though it does not borrow any concept from StyleGAN2 per se.

Relevant links:

Improved Denoising Diffusion Probabilistic Models


  • Improves DDPM to achieve competitive NLL and image quality on par with SotA image models with a few simple modifications.
  • By introducing hierarchy a la VQVAE-2, it also performs almost on par with BigGAN in terms of FID on Imagenet 256 x 256, likely the best one among models with competitive NLL.
  • Improves the sampling speed to at best 50 times as slow as GAN.
Sample quality comparison on class conditional ImageNet 256 × 256. BigGAN FIDs are reported for the truncation that results in the best FID.


  • While DDPM has improved considerably with relatively short amount of time, it is still unclear how to reduce the sampling time to the level of VDVAE and GAN.
  • They have observed that, DDPM scales up, and as with NLL, early-stopping works for FID, which is non-trivial given FID’s occasional lack of correlation with NLL. Hopefully, this leads to more compute-efficient scaling practice in image models with larger model size and fewer number of iterations spent.
Ablating schedule and objective on Imagenet 64×64. As one can see, NLL and FID do not correlate well in some cases.

Relevant links:

Score-Based Generative Modeling through Stochastic Differential Equations


  • Proposes NCSN++, which almost matches SotA autoregressive models in NLL and StyleGAN2 (SotA) in FID on CIFAR-10.
Solving a reverse-time SDE yields a score-based generative model. Transforming data to a simple noise distribution can be accomplished with a continuous-time SDE. This SDE can be reversed to generate an image out of noise if we know the score of the distribution at each intermediate time step.


  • As far as CIFAR-10 is concerned, NCSN++ offers both SotA quality and diversity. As with DDPMv2, it still suffers from slow sampling speed.
  • While generation quality of higher-resolution images is impressive, no quantitative evaluation was performed on a large dataset such as Imagenet. We hope a subsequent work will resolve this issue and possibly improve its sampling speed, given that their approach is very novel and should have much potential left for improvement.
  • Especially, a head-to-head comparison against DDPMv2 would be helpful to see whether using SDE would offer more advantage over the counterpart of DDPMv2 in terms of performance-computes trade-off.

Relevant links:


2 thoughts on “State-of-the-Art Image Generative Models

Add yours

  1. Hey! I want to introduce you to our work on making diffusion models faster for image generation.
    We can achieve competitive FID scores with around 20 – 50 steps instead of 1000 steps.
    With limited compute resources for samples, our method could be much more realistic than DDPMs.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

Create a website or blog at

Up ↑