A powerful perspective on Machine Learning is that the most fundamental task in AI is to learn a probabilistic joint distribution which models the lifeworld. For such a model to be optimal, it would need to understand many things. It would need to understand basic laws of physics. It would need to be able to do conditional generation: predicting how day will turn to night, how leaves will fall from trees, how the young will grow old. By predicting past actions conditioned on future rewards, one could even learn to plan and act.
Unfortunately the current state of the art is still far from this ideal. While we now have compelling probabilistic models of small images and text-conditioned sound clips, we are very far from modeling the lifeworld of a human being or even a simple animal.
The project for this course is to generate the middle region of images conditioned on the outside border of the image and a caption describing the image. To be as successful as possible, the model needs to be able to understand the specific meaning of the caption in the context of a specific image.
To our knowledge, this is a novel task and has never been done before, so a successful project would in fact be a valuable research contribution. However we also designed the project so that a very simple solution will still be able to achieve reasonable results.
Questions of interest to us include:
- To what extend is the caption useful for the inpainting task?
- What models produce the “best” results?
- How can we quantitatively evaluate the output in way that matches human judgements of quality?
We will provide the specific data to be used by all students.
The original images in the mscoco dataset are high resolution: roughly 500×500 pixels. While this will give the most interesting results, it would likely be too computationally intensive for a class project (and indeed, for most research projects). Many successful generative modeling papers still evaluate on 32×32 images. In fact, there can still be a lot of detail in a 32×32 image!
For the actual project, we downsampled all images to 64×64 and the goal is to complete the middle 32×32 section.
Here are actual example images for the project:
Reasonable Model Choices
A successful model is likely to use one of these basic elements, or several of them together, although you should by no means limit yourself to these tools:
-L2 Reconstruction Network (similar to autoencoder)
-Autoregressive Model (especially PixelCNN)
-Plug and Play Networks (PPGN)
-Part of speech taggers for picking keywords in the captions.
Some of the best generative models don’t give a tractable quantitative evaluation metric (they just provide samples), so we’re not going to require any quantitative evaluation. Visual inspection of samples is enough.
However, if a student wishes to do quantitative evaluation, that will be seen as a major positive. How to best do quantitative evaluation of image generation systems is currently a major research topic.
This is an individual project, but we encourage students to share intermediate results and ideas as long as you are transparent about what you’ve used from others!
Intermediate Evaluation Date: TBA
Due Date: TBA
The dataset is a downsampled 64×64 version of the MSCOCO dataset (http://mscoco.org/dataset/#overview)
Some of the images might be in grayscale instead of RGB, you can just skip those images.
The dataset is available here: dataset,
In particular the archive inpainting.tar.bz2 contains:
– train2014: directory composed by 82782 training images,
– val2014: directory composed by 40504 validations images,
– dict_key_imgID_value_caps_train_and_valid.pkl: a pickled python dictionary containing the captions associated to the train/valid images,
– worddict.pkl: a pickled dictionary of the different words composing the captions.
To extract the archive:
tar xjvf inpainting.tar.bz2
Hades GPU cluster:
Please refer to Cluster tutorial to see how to use the Hades GPU cluster. You should have received your account information by mail.