ABOUT ME

-

Today
-
Yesterday
-
Total
-
  • [VAE-GAN] Autoencoding beyond pixels using a learned similarity metric
    Research/Generative Model 2024. 5. 18. 09:54

    https://arxiv.org/pdf/1512.09300


    Abstract

    We present an autoencoder that leverages learned representations to better measure similarities in data space. By combining a variational autoencoder with a generative adversarial network we can use learned feature representations in the GAN discriminator as basis for the VAE reconstruction objective. Thereby, we replace element-wise errors with feature-wise errors to better capture the data distribution while offering invariance towards e.g. translation. We apply our method to images of faces and show that it outperforms VAEs with element-wise similarity measures in terms of visual fidelity. Moreover, we show that the method learns an embedding in which high-level abstract visual features (e.g. wearing glasses) can be modified using simple arithmetic.


    1. Introduction 

    Deep architectures have allowed a wide range of discriminative models to scale to large and diverse datasets. However, generative models still have problems with complex data distributions such as images and sound. In this work, we show that currently used similarity metrics impose a hurdle for learning good generative models and that we can improve a generative model by employing a learned similarity measure.

     

    When learning models such as the variational autoencoder (VAE) (Kingma & Welling, 2014; Rezende et al., 2014), the choice of similarity metric is central as it provides the main part of the training signal via the reconstruction error objective. For this task, element-wise measures like the squared error are the default. Element-wise metrics are simple but not very suitable for image data, as they do not model the properties of human visual perception. E.g. a small image translation might result in a large pixel-wise error whereas a human would barely notice the change. Therefore, we argue in favor of measuring image similarity using a higher-level and sufficiently invariant representation of the images. Rather than hand-engineering a suitable measure to accommodate the problems of elementwise metrics, we want to learn a function for the task. The question is how to learn such a similarity measure? We find that by jointly training a VAE and a generative adversarial network (GAN) (Goodfellow et al., 2014) we can use the GAN discriminator to measure sample similarity. We achieve this by combining a VAE with a GAN as shown in Fig. 1. We collapse the VAE decoder and the GAN generator into one by letting them share parameters and training them jointly. For the VAE training objective, we replace the typical element-wise reconstruction metric with a featurewise metric expressed in the discriminator.

    1.1. Contributions

    Our contributions are as follows:

    • We combine VAEs and GANs into an unsupervised generative model that simultaneously learns to encode, generate and compare dataset samples.
    • We show that generative models trained with learned similarity measures produce better image samples than models trained with element-wise error measures.

    2. Autoencoding with learned similarity

    In this section we provide background on VAEs and GANs. Then, we introduce our method for combining both approaches, which we refer to as VAE/GAN. As we’ll describe, our proposed hybrid is motivated as a way to improve VAE, so that it relies on a more meaningful, featurewise metric for measuring reconstruction quality during training.


    2.1. Variational autoencoder

    A VAE consists of two networks that encode a data sample x to a latent representation z and decode the latent representation back to data space, respectively:

    The VAE regularizes the encoder by imposing a prior over the latent distribution p(z). Typically z ∼ N (0, I) is chosen. The VAE loss is minus the sum of the expected log likelihood (the reconstruction error) and a prior regularization term:

    with

    where DKL is the Kullback-Leibler divergence.


    2.2. Generative adversarial network

    A GAN consists of two networks: the generator network Gen(z) maps latents z to data space while the discriminator network assigns probability y = Dis(x) ∈ [0, 1] that x is an actual training sample and probability 1 − y that x is generated by our model through x = Gen(z) with z ∼ p(z). The GAN objective is to find the binary classifier that gives the best possible discrimination between true and generated data and simultaneously encouraging Gen to fit the true data distribution. We thus aim to maximize/minimize the binary cross entropy:

    with respect to Dis / Gen with x being a training sample and z ∼ p(z).


    2.3. Beyond element-wise reconstruction error with VAE/GAN

    An appealing property of GAN is that its discriminator network implicitly has to learn a rich similarity metric for images, so as to discriminate them from “non-images”. We thus propose to exploit this observation so as to transfer the properties of images learned by the discriminator into a more abstract reconstruction error for the VAE. The end result will be a method that combines the advantage of GAN as a high quality generative model and VAE as a method that produces an encoder of data into the latent space z.

     

    Specifically, since element-wise reconstruction errors are not adequate for images and other signals with invariances, we propose replacing the VAE reconstruction (expected log likelihood) error term from Eq. 3 with a reconstruction error expressed in the GAN discriminator. To achieve this, let Disl(x) denote the hidden representation of the lth layer of the discriminator. We introduce a Gaussian observation model for Disl(x) with mean Disl(x˜) and identity covariance:

    where x˜ ∼ Dec(z) is the sample from the decoder of x. We can now replace the VAE error of Eq. 3 with

    We train our combined model with the triple criterion

    Notably, we optimize the VAE wrt. LGAN which we regard as a style error in addition to the reconstruction error which can be interpreted as a content error using the terminology from Gatys et al. (2015). Moreover, since both Dec and Gen map from z to x, we share the parameters between the two (or in other words, we use Dec instead of Gen in Eq. 5).

     

    We refer to Fig. 2 and Alg. 1 for overviews of the training procedure.


    4. Experiments

     


    5. Discussion

    The problems with element-wise distance metrics are well known in the literature and many attempts have been made at going beyond pixels – typically using hand-engineered measures. Much in the spirit of deep learning, we argue that the similarity measure is yet another component which can be replaced by a learned model capable of capturing high-level structure relevant to the data distribution. In this work, our main contribution is an unsupervised scheme for learning and applying such a distance measure. With the learned distance measure we are able to train an image encoder-decoder network generating images of unprecedented visual fidelity as shown by our experiments. Moreover, we show that our network is able to disentangle factors of variation in the input data distribution and discover visual attributes in the high-level representation of the latent space. In principle, this lets us employ a large set of unlabeled images for training and use a small set of labeled images to discover features in latent space.

     

    We regard our method as an extension of the VAE framework. Though, it must be noted that the high quality of our generated images is due to the combined training of Dec as a both a VAE decoder and a GAN generator. This makes our method more of a hybrid between VAE and GAN, and alternatively, one could view our method more as an extension of GAN where p(z) is constrained by an additional network.

     

    It is not obvious that the discriminator network of a GAN provides a useful similarity measure as it is trained for a different task, namely being able to tell generated samples from real samples. However, convolutional features are often surprisingly good for transfer learning, and as we show, good enough in our case to improve on elementwise distances for images. It would be interesting to see if better features in the distance measure would improve the model, e.g. by employing a similarity measure provided by a Siamese network trained on faces, though in practice Siamese networks are not a good fit with our method as they require labeled data. Alternatively one could investigate the effect of using a pretrained feedforward network for measuring similarity.

     

    In summary, we have demonstrated a first attempt at unsupervised learning of encoder-decoder models as well as a similarity measure. Our results show that the visual fidelity of our method is competitive with GAN, which in that regard is considered state-of-the art. We therefore consider learned similarity measures a promising step towards scaling up generative models to more complex data distributions.

Designed by Tistory.