# Q19 – Linear Regression

Consider a linear regression problem with input data $\boldmath{X} \in \mathbb{R}^{n\times d}$. weights $\boldmath{w} \in \mathbb{R}^{d \times 1}$ and and targets $\boldmath{y} \in \mathbb{R}^{n \times 1}$. Now, suppose that dropout is being applied to the input units with probability $p$.

1) Rewrite the input data matrix taking into account the probability of each unit to be dropped out (Hint: the probability of each unit to be dropped out is a Bernoulli random variable with probability $p$).

2)What is the cost function of the linear regression with dropout?

3)Show that applying dropout to the linear regression problem aforementioned can be seen as using L2 regularization in the loss function.

# Q18 – Regularization

Show that L2 regularization applied to a linear regression with weights $\boldsymbol{w}$, input data $\boldsymbol{x}$ and targets $\boldsymbol{y}$ with mean squared error loss function corresponds to assuming a Gaussian prior over the weights.

# Q17 – ConvNet Invariances

Question 1: A convolutional neural network (CNN) has the ability to be “insensitive” to some slight spatial variations in the input data, such as translation. In comparison with the regular feed-forward neural networks, the CNN architecture has two components responsible for providing this kind of insensitivity. Explain which are those components and how a CNN can ignore small translations in the input data.

# Q16 – Linear RNN Dynamics

Consider the behavior of a linear RNN:
$h_t = W h_{t-1} + U x_{t} + b$

1.  Write $h_t$ as a function of $h_0$.
2.  Write out $\frac{d h_t}{d h_0}$.
3.  What happens when $t \to \infty$? Under what conditions?

# Q15 – Softmax and Cross Entropy

The softmax function for $m$ classes is given by

$p_i = \frac{e^{x_i}}{\sum_{j=1}^m e^{x_j}} \text{ for } i = 1\ldots m$.

It transforms a vector $(x_i)$ of real values into a probability mass vector for a categorical distribution.  It is often used in conjunction with the cross-entropy loss
$L(x, y) = - \sum_{i=1}^m y_i \log p_i$

1. Find a simplified expression for $p_i$ when $k = 2$.
2. Differentiate $p_i$ with respect to $x_k$.
3. Differentiate $L$ with respect to $x_k$.

# Q13 – Activation Functions II

Contributed by Pulkit Khandelwal.

Consider a neural network as shown in the Figure below. The network has linear activation functions. Let the various weights be defined as shown in the figure and also the output of each unit is multiplied by some constant k.

1. Re-design the neural network to compute the same function without using any hidden units. Express the new weights in terms of the old weights. Draw the obtained perceptron.
2. Can the space of functions that is represented by the above artificial neural network also be represented by linear regression?
3. Is it always possible to express a neural network made up of only linear units without a hidden layer? Give a brief justification.
4. Let the hidden units use sigmoid activation functions and let the output unit use a threshold activation function. Find weights which cause this network to compute the XOR of $X_{1}$ and $X_{2}$ for binary-valued $X_{1}$ and $X_{2}$. Assume that there are no bias terms.

# Q12 – Function Representation and Network Capacity

Contributed by Pulkit Khandelwal.

Let us say that we are given two types of activation functions: linear and a hard threshold function as stated below:

• Linear:  $y = w_{0} + \sum_{i}w_{i}x_{i}$
• Hard Threshold:  $y=\left\{ \begin{array}{@{}ll@{}} 1, & \text{if}\ w_{0} + \sum_{i}w_{i}x_{i} \geq 0 \\ 0, & \text{otherwise} \end{array}\right.$

Which of the following can be exactly represented by a neural network with one hidden layer? You can use linear and/or threshold activation functions. Justify your answer with a brief explanation.

1. polynomials of degree 2
2. polynomials of degree 1
3. hinge loss
4. piecewise constant functions