Contributed by Philippe Lacaille.

Under a given training dataset , where and . Let’s further consider a

standard feedforward network with a hidden layer . For the rest of the exercise, we denote as the pre-activated layer and as the post-activated layer.

- Assuming the network’s goal is to do binary classification (with the

detailed structure above), what would be an appropriate:- Activation function for the output layer ? What does the output represent under this activation function
- Loss function, represented by , that will be used in training?
*(You can write it as a function of )*.

- Now let’s assume the following about the network, while still using the same dataset:
- Network goal is now to do multi-class classification
- The set of parameters
- has 10 features and there are 15 units at the hidden layer .

- What would be an appropriate activation function for the output layer under this structure? What does the output vector represent under this activation function?
- What are the number of dimensions for each of the parameters in ?
- Using gradient descent, what is the update function for the parameters (no regularization)?
*You can write it as a function of*.- if using batch gradient descent?
- if using mini-batch (size ) gradient descent?
- if using stochastic gradient descent?

- Under a classification problem, why isn’t the network trained to directly minimize the classification error?

Advertisements

Answer – Q2:

1.

a) For binary classification, we should use a sigmoid activation, so that the output is between 0 and 1. It can be interpreted as a posterior probability in the binary case

b) BCE (binary cross entropy) would be an appropriate loss.

2.

a) For multiclass classification, we should use a softmax activation. It can be interpreted as the posterior probability of choosing one class

b)

— W(1) is of dimension (10, 15)

— b(1) is of dimension (15)

— W(2) is of dimension (15, 3)

— b(2) is of dimension (3)

c)

For batch gradient descent,

For mini-batch gradient descent,

For stochastic gradient descent,

3. Minimizing classification error would be ideal, but the function is non-smooth, so we need to use a surrogate loss instead.

LikeLiked by 1 person

3. Furthermore, there is no gradient with classification error.

LikeLiked by 2 people

In my answer of item 3 I just said that the classification error is not a differentiable function.

LikeLike

Classification error is a sum of 0/1 loss for each data point so it’s differentiable at every point except at the points where the 0 jumps to 1. So I think this is not the main problem (like for relu, I think that if it’s not differentiable for only a countable number of points, that’s not “really” a problem since we most likely won’t fall on those points). I think the problem is more because the gradient is 0 everywhere (except where 0 jumps to 1) so we can’t learn. Do you think it’s correct?

LikeLiked by 3 people

I think so! I should have mentioned this aspect in my answer. Thanks.

LikeLike