# Assignment: MLP on MNIST

As a starting point for the class, you should have a good enough understanding of Python and numpy to work through the basic task of classifying MNIST digits with a one-hidden-layer MLP.

You’re not required to hand in anything. However, it is highly recommended that you complete it, as some of the material that will be covered in the next few weeks will build on the assumption that you’re comfortable with this assignment.

### Refresher: MNIST

The MNIST dataset contains labeled images of handwritten digits. Images have size 28 x 28 pixels and are divided into a training set containing 60,000 examples and a test set containing 10,000 examples. Here are some examples:

wget http://deeplearning.net/data/mnist/mnist.pkl.gz


In this version of the dataset, the training set has further been split into a training set containing 50,000 examples and a validation set containing 10,000 examples.

Once unzipped and unpickled, the file contains a tuple with three elements for the training, validation and test set respectively. Each element is itself a tuple containing a $n \times 784$ matrix of examples and a $n$-dimensional vector of labels (where $n$ is the number of examples in this specific set). Features in the example matrix are floats in [0, 1] (0 being black, and 1 being white), and labels are integers ranging between 0 and 9.

You can load the data like so:

import cPickle
import gzip

with gzip.open('mnist.pkl.gz', 'rb') as f:
# Whatever else needs to be done


### Refresher: MLP

A multilayer perceptron (MLP) produces a prediction according to the following equations:

${\mathbf{h} = \sigma (\mathbf{W} \mathbf{x} + \mathbf{b}) = \frac{1}{1 + \exp(-\mathbf{W} \mathbf{x} - \mathbf{b})}}$

$\mathbf{y} = \frac{\exp(\mathbf{V} \mathbf{h} + \mathbf{c})}{\sum_i \exp(\mathbf{V}_i \mathbf{h} + c_i)}$

Here, $\mathbf{y}$ is interpreted as a vector of probabilities, i.e. $y_k$ is the probability that example $\mathbf{x}$ belongs to class $k$ and is normalized such that $\sum_k y_k = 1$.

Given a one-hot encoded target $\mathbf{t}$ (i.e., $\mathbf{t} = [t_1, \ldots, t_C]$ where $C$ is the number of classes, $t_k \in \{0, 1\}$ and $\sum_k t_k = 1$), the loss function is definded as

${\mathcal{L} = \sum_k -t_k \log(y_k)}$

For this assignment, you are asked to implement a one-hidden-layer MLP and train it on MNIST using Python and numpy. You’ll need to do the following:

1. Using the backpropagation principle, find the derivatives of the loss function with respect to parameters $W_{ij}, b_j, V_{jk}, c_k$.
2. Using what you derived in step 1, find an expression for the derivative of the loss function with respect to parameter vectors and matrices $\mathbf{W}, \mathbf{b}, \mathbf{V}, \mathbf{c}$. This is a good thing to do, because numpy has abstractions to represent dot products between vectors, vector-matrix products, elementwise operations, etc. that rely on heavily optimized linear algebra libraries.  Coding up all these operations using for-loops would be both inefficient and tedious.
3. Implement the code necessary to compute gradients using the expressions derived in step 2 for a pair of feature and target vectors $\mathbf{x}$ and $\mathbf{t}$.
4. Using the finite differences method, implement a numerical version of the code that computes the gradients, and compare the analytical gradients and the numerical gradients using a toy example (e.g. a random pair of feature and target vectors for an MLP with 10 input dimensions, 5 hidden units and 2 categories). A mismatch between the two indicates an error in the implementation of the analytical gradients. The numpy.testing.assert_allclose method will be handy for that.
5. Generalize your code to work on minibatches of data. Make sure that numerical and analytical gradients still check out on a toy example.
6. Implement a simple minibatch SGD loop and train your MLP on MNIST. You should be able to achieve under 5% error rate on the test set.

### Optional Extra tasks (for fun)

1. When computing your backward pass, try using a random weight matrix instead of your learned weight matrix.  Does it effect the results?  Does it matter if you use a single fixed random weight matrix or draw a new random weight matrix on each iteration?
2. When computing your backward pass, try using a copy of your weight matrix from k iterations ago.  Does this effect the model’s performance?
3. When computing your backward pass, try using random hidden states instead of what you used in the forward pass.  Does it effect the results?  Does it matter if the random hidden states are fixed or redrawn on each iteration?
4. When computing dL/dw, try using a random h instead of the h from your forward pass.  Does it effect the results?
5. Try adding a little bit of gaussian noise to h in the forward pass.  How does it effect results?  Try adding a little bit of gaussian noise to h when doing the backward pass.  How does it effect results?  Which is better?  Does it work better or worse (or the same) if you use the same sample or resample from the noise distribution in the forward and backward pass?
6. When doing your forward pass, try rounding each value of h to the nearest integer.  Does this effect results?  Does it effect results if you round the h during the backward pass or when computing dL/dw?