Q8 – Convolution and Fourier Transform

Contributed by Louis-Guillaume Gagnon


  1.  Consider a convolution operation over continuous functions: s(t) = \int_{-\infty}^\infty x(a)w(t - a)da . Using the definition of the Fourier transform F(k) = \int_{-\infty}^\infty f(x)e^{-2i\pi k x}dx and its inverse f(x) = \int_{-\infty}^\infty F(k)e^{2i\pi k x}dk, show that taking the inverse transform of the product of two fourier-transformed functions yields the convolution of these functions in their original domain.
  2. For discrete functions over a finite domain, what is the time complexity of the regular, time-domain-only convolution operation? What is the time complexity of a convolution implemented using the Fourier transform?
  3. Using the programming language of your choice (e.g. python with numpy), empirically verify your results of 2).


Q7 – Jacobian of Softmax and Sigmoid

Contributed by Louis-Guillaume Gagnon


  1.  Compute the jacobian matrix of the softmax function, S(x_i) = \frac{e^{x_i}}{\sum_k e^{x_k}}. Express it as a matrix equation.
  2. Compute the jacobian matrix of the sigmoid function, \sigma(x) = 1/(1 + e^{-x})
  3.  Let y and x be vectors related by y = f(x). Let L be an unspecified loss function. Let g_x and g_y be the gradient of L with respect to x and y.  Let J_y(x) be the jacobian of y with respect to x. Eq. 6.46 (of the Deep Learning book) tells us that: g_x = J_y(x) \cdot g_y. Show that if f(x) \equiv \sigma(x), the above can be rewritten g_x = g_y \odot \sigma'(x) . If f(x) \equiv S(x), can g_x be defined the same way?

Q6 – Mixture Density Network

Contributed by Chin-Wei Huang.


Consider a univariate-output Mixture Density Network, with n specifying the number of latent variables: p(y|x) = \sum_{i=1}^n p(c=i|x)p(y|x;\theta_i) where \theta=\{\theta_i\}_{i=1:n} are the set of class conditional parameters.

In the following questions, assume the “prior” probability distributions p(c=i|x) form a multinoulli distribution, parameterized by a softmax function (a one hidden-layer network) mapping from the input, i.e. p(c=i|x;\zeta) = s_i(x;\zeta) = \frac{\exp{(\zeta_i^T{x})} }{{\sum_{i'}{\exp{(\zeta_{i'}^T{x})}}}}.

  1. Suppose Y is continuous, and let p(y;\theta^{(i)}(x)) = \mathcal{N}(y;x^T\beta_i,\sigma^2_i). To do prediction, use the expected conditional as a point estimate of the output. Derive \mathbf{E}[y|x] and \mathbf{Var}[y|x].}
  2. Holding the class conditional parameters (\theta_i) fixed, derive a stochastic (i.e. for one data point) gradient ascent expression for the softmax weight parameters \zeta_i using maximum likelihood principle. (Hint: M-step of the EM algorithm).
  3. Now devise a prediction mapping function h:\mathcal{X}\rightarrow\mathcal{Y} defined as h(x)=\sum_i^n \sigma_i(x)\mu_i(x) , where generally \sigma(\cdot) is a MLP and \mu(\cdot) is a prediction function depending on the input x. Now let \sigma(x) be a softmax regression of n classes and \mu be a set of n linear mapping functions, i.e. h(y|x) = \sum_i^n \{s_i(x;\zeta) (\sum_j^p x_j\beta_{ij}) \}. If we want to minimise the quadratic loss l(y_n,x_n) = (y_n - h(y|x_n))^2 for each data point $n$, what is the gradient descent update expression for parameter \zeta_i if \beta is fixed?
  4. Comment on the difference between the previous two training objectives.