# Q8 – Convolution and Fourier Transform

Contributed by Louis-Guillaume Gagnon

1.  Consider a convolution operation over continuous functions: $s(t) = \int_{-\infty}^\infty x(a)w(t - a)da$. Using the definition of the Fourier transform $F(k) = \int_{-\infty}^\infty f(x)e^{-2i\pi k x}dx$ and its inverse $f(x) = \int_{-\infty}^\infty F(k)e^{2i\pi k x}dk$, show that taking the inverse transform of the product of two fourier-transformed functions yields the convolution of these functions in their original domain.
2. For discrete functions over a finite domain, what is the time complexity of the regular, time-domain-only convolution operation? What is the time complexity of a convolution implemented using the Fourier transform?
3. Using the programming language of your choice (e.g. python with numpy), empirically verify your results of 2).

# Q7 – Jacobian of Softmax and Sigmoid

Contributed by Louis-Guillaume Gagnon

1.  Compute the jacobian matrix of the softmax function, $S(x_i) = \frac{e^{x_i}}{\sum_k e^{x_k}}$. Express it as a matrix equation.
2. Compute the jacobian matrix of the sigmoid function, $\sigma(x) = 1/(1 + e^{-x})$
3.  Let $y$ and $x$ be vectors related by $y = f(x)$. Let $L$ be an unspecified loss function. Let $g_x$ and $g_y$ be the gradient of $L$ with respect to $x$ and $y$.  Let $J_y(x)$ be the jacobian of y with respect to $x$. Eq. 6.46 (of the Deep Learning book) tells us that: $g_x = J_y(x) \cdot g_y$. Show that if $f(x) \equiv \sigma(x)$, the above can be rewritten $g_x = g_y \odot \sigma'(x)$. If $f(x) \equiv S(x)$, can $g_x$ be defined the same way?

# Q6 – Mixture Density Network

Contributed by Chin-Wei Huang.

Consider a univariate-output Mixture Density Network, with $n$ specifying the number of latent variables: $p(y|x) = \sum_{i=1}^n p(c=i|x)p(y|x;\theta_i)$ where $\theta=\{\theta_i\}_{i=1:n}$ are the set of class conditional parameters.

In the following questions, assume the “prior” probability distributions $p(c=i|x)$ form a multinoulli distribution, parameterized by a softmax function (a one hidden-layer network) mapping from the input, i.e. $p(c=i|x;\zeta) = s_i(x;\zeta) = \frac{\exp{(\zeta_i^T{x})} }{{\sum_{i'}{\exp{(\zeta_{i'}^T{x})}}}}$.

1. Suppose $Y$ is continuous, and let $p(y;\theta^{(i)}(x)) = \mathcal{N}(y;x^T\beta_i,\sigma^2_i)$. To do prediction, use the expected conditional as a point estimate of the output. Derive $\mathbf{E}[y|x]$ and $\mathbf{Var}[y|x]$.}
2. Holding the class conditional parameters ($\theta_i$) fixed, derive a stochastic (i.e. for one data point) gradient ascent expression for the softmax weight parameters $\zeta_i$ using maximum likelihood principle. (Hint: M-step of the EM algorithm).
3. Now devise a prediction mapping function $h:\mathcal{X}\rightarrow\mathcal{Y}$ defined as $h(x)=\sum_i^n \sigma_i(x)\mu_i(x)$, where generally $\sigma(\cdot)$ is a MLP and $\mu(\cdot)$ is a prediction function depending on the input $x$. Now let $\sigma(x)$ be a softmax regression of $n$ classes and $\mu$ be a set of $n$ linear mapping functions, i.e. $h(y|x) = \sum_i^n \{s_i(x;\zeta) (\sum_j^p x_j\beta_{ij}) \}$. If we want to minimise the quadratic loss $l(y_n,x_n) = (y_n - h(y|x_n))^2$ for each data point $n$, what is the gradient descent update expression for parameter $\zeta_i$ if $\beta$ is fixed?
4. Comment on the difference between the previous two training objectives.

# Q5 – Softmax and Sigmoid

Contributed by  Antoine Lefebvre-Brossard

1. Show that a 2-class softmax function can be rewritten a a sigmoid function. In
other words that $\text{softmax}(\boldsymbol{x}) = \sigma(z)$ where the softmax function is defined by  $\text{softmax}(\boldsymbol{x}) = (\frac{e^{x_1}}{e^{x_1}+e^{x_2}}, \frac{e^{x_2}}{e^{x_1}+e^{x_2}})^T$ and the sigmoid function is defined by $\sigma(z) = \frac{1}{1 + e^{-z}}$
2. Use the previous result to show that it’s possible to write a $k$-class softmax
function as a function of $k-1$ variables.

# Q4 – Capacité vs early-stopping

De Stéphanie Larocque

Pourquoi la capacité d’un modèle augmente lorsque le nombre d’itérations augmente? Expliquez l’intuition derrière le lien entre l’early-stopping et le $L_2$ weight decay pour un modèle linéaire.