Q2 – Basics of Neural Networks

Contributed by Philippe Lacaille.

Under a given training dataset \mathit{D}=\{(x^{(1)}, y^{(1)}),  (x^{(2)}, y^{(2)}), \dots, (x^{(n)}, y^{(n)}) \}, where x^{(i)} \in  \mathbb{R}^d and y^{(i)} \in \{ 0, 1 \}. Let’s further consider a
standard feedforward network with a hidden layer h. For the rest of the exercise, we denote h^a as the pre-activated layer and h^s as the post-activated layer.

  1.  Assuming the network’s goal is to do binary classification (with the
    detailed structure above), what would be an appropriate:

    1. Activation function for the output layer f(x)? What does the output represent under this activation function
    2. Loss function, represented by L(f(x, \theta), y), that will be used in training? (You can write it as a function of f(x, \theta)).
  2. Now let’s assume the following about the network, while still using the same dataset:
    • Network goal is now to do multi-class classification y^{(i)} \in \{ 1, 2, 3 \}
    • The set of parameters \theta = \{ W^{(1)}, b^{(1)}, W^{(2)}, b^{(2)} \}
    • x^{(i)} has 10 features and there are 15 units at the hidden layer h.
    1. What would be an appropriate activation function for the output layer f(x) under this structure? What does the output vector represent under this activation function?
    2. What are the number of dimensions for each of the parameters in \theta?
    3. Using gradient descent, what is the update function for the parameters \theta (no regularization)? You can write it as a function of L(f(x, \theta), y).
      1. if using batch gradient descent?
      2. if using mini-batch (size m) gradient descent?
      3. if using stochastic gradient descent?
  3. Under a classification problem, why isn’t the network trained to directly minimize the classification error?

5 thoughts on “Q2 – Basics of Neural Networks

  1. Answer – Q2:

    a) For binary classification, we should use a sigmoid activation, so that the output is between 0 and 1. It can be interpreted as a posterior probability in the binary case p(y=1|x)

    b) BCE (binary cross entropy) would be an appropriate loss.
    L(f(x,\theta),y) = -y * log(f(x,\theta)) - (1-y) * log(1 - f(x,\theta))

    a) For multiclass classification, we should use a softmax activation. It can be interpreted as the posterior probability of choosing one class p(y=c|x)

    — W(1) is of dimension (10, 15)
    — b(1) is of dimension (15)
    — W(2) is of dimension (15, 3)
    — b(2) is of dimension (3)

    c) \theta = \theta - \alpha \Delta
    For batch gradient descent, \Delta = \frac{1}{n} \sum_{i=1}^n \nabla_{\theta} L(f(x^{(i)}, \theta), y^{(i)})
    For mini-batch gradient descent, \Delta = \frac{1}{m} \sum_{i=1}^m \nabla_{\theta} L(f(x^{(i)}, \theta), y^{(i)})
    For stochastic gradient descent, \Delta = \nabla_{\theta} L(f(x^{(i)}, \theta), y^{(i)})

    3. Minimizing classification error would be ideal, but the function is non-smooth, so we need to use a surrogate loss instead.

    Liked by 1 person

    • Classification error is a sum of 0/1 loss for each data point so it’s differentiable at every point except at the points where the 0 jumps to 1. So I think this is not the main problem (like for relu, I think that if it’s not differentiable for only a countable number of points, that’s not “really” a problem since we most likely won’t fall on those points). I think the problem is more because the gradient is 0 everywhere (except where 0 jumps to 1) so we can’t learn. Do you think it’s correct?

      Liked by 3 people

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s