In these lectures, we discuss autoregressive generative models such as NADE, MADE, PixelCNN, PixelRNN, and the PixelVAE.

**Slides:**

- Autoregressive Generative Models (slides from Hugo Larochelle, Vincent Dumoulin and Aaron Courville)

**Reference: **(* = you are responsible for this material)

- *Sections 20.10.5-20.10.10 of the Deep Learning textbook.
- The Neural Autoregressive Distribution Estimator by Hugo Larochelle and Iain Murray (AISTAT2011)
- MADE: Masked Autoencoder for Distribution Estimation by Mathieu Germain, Karol Gregor, Iain Murray, Hugo Larochelle (ICML2015).
- Pixel Recurrent Neural Networks by Aaron van den Oord, Nal Kalchbrenner, Koray Kavukcuoglu (ICML2016)
- *Conditional Image Generation with PixelCNN Decoders by Aaron van den Oord, Nal Kalchbrenner, Oriol Vinyals, Lasse Espeholt, Alex Graves, Koray Kavukcuoglu (NIPS2016)
- Parallel Multiscale Autoregressive Density Estimation by Scott Reed, Aaron van den Oord, Nal Kalchbrenner, Sergio Gomez Colmenarejo, Ziyu Wang, Dan Belov and Nando de Freitas (arXiv:1703.03664, 2017)

Advertisements

Practical question: how do you actually plug in the captions?

If you have a GAN where the generator creates an image based on a noise vector, the captions can be an input alongside the noise, similarly for the discriminator, it sees the image and the caption and evaluates accordingly.

But how do you go from text to a vector/image? I’ve trained a char-level LSTM rNN, it outputs more text fine. Instead of outputting text, it needs to output a sort of embedding? What is an embedding anyway? A dictionary of words or word fragments + a conditional probability [0,1] based on previous inputs?

LikeLike

The hidden states of the RNN would be considered an embedding. What is commonly done is to take the last hidden state (which is a vector) and consider that to be a summary of what the RNN has seen. Alternatively you can attend to the sequence of hidden states using any of the schemes discussed in the last lecture.

You can map the vector to an “image” by a linear mapping to a high-dimensional space and reshaping the resulting vector to have an image shape. E.g. if your vector has size 128 and you want to end up with a 64×64 image, you can use a 128×4096 weight matrix to go from 128 to 4096 and then reshape to 64×64. If you want multiple feature maps, say you want to map to 64x64x10, then use a 128×40960 matrix.

LikeLiked by 1 person

Pretty clear. Thanks Tim!

LikeLike

[…] Caption embedding into a vector/image (how-to) […]

LikeLike