Q2 – Basics of Neural Networks

Contributed by Philippe Lacaille.

Under a given training dataset $\mathit{D}=\{(x^{(1)}, y^{(1)}), (x^{(2)}, y^{(2)}), \dots, (x^{(n)}, y^{(n)}) \}$, where $x^{(i)} \in \mathbb{R}^d$ and $y^{(i)} \in \{ 0, 1 \}$. Let’s further consider a
standard feedforward network with a hidden layer $h$. For the rest of the exercise, we denote $h^a$ as the pre-activated layer and $h^s$ as the post-activated layer.

1.  Assuming the network’s goal is to do binary classification (with the
detailed structure above), what would be an appropriate:

1. Activation function for the output layer $f(x)$? What does the output represent under this activation function
2. Loss function, represented by $L(f(x, \theta), y)$, that will be used in training? (You can write it as a function of $f(x, \theta)$).
2. Now let’s assume the following about the network, while still using the same dataset:
• Network goal is now to do multi-class classification $y^{(i)} \in \{ 1, 2, 3 \}$
• The set of parameters $\theta = \{ W^{(1)}, b^{(1)}, W^{(2)}, b^{(2)} \}$
• $x^{(i)}$ has 10 features and there are 15 units at the hidden layer $h$.
1. What would be an appropriate activation function for the output layer $f(x)$ under this structure? What does the output vector represent under this activation function?
2. What are the number of dimensions for each of the parameters in $\theta$?
3. Using gradient descent, what is the update function for the parameters $\theta$ (no regularization)? You can write it as a function of $L(f(x, \theta), y)$.
1. if using batch gradient descent?
2. if using mini-batch (size $m$) gradient descent?
3. if using stochastic gradient descent?
3. Under a classification problem, why isn’t the network trained to directly minimize the classification error?

5 thoughts on “Q2 – Basics of Neural Networks”

1.
a) For binary classification, we should use a sigmoid activation, so that the output is between 0 and 1. It can be interpreted as a posterior probability in the binary case $p(y=1|x)$

b) BCE (binary cross entropy) would be an appropriate loss.
$L(f(x,\theta),y) = -y * log(f(x,\theta)) - (1-y) * log(1 - f(x,\theta))$

2.
a) For multiclass classification, we should use a softmax activation. It can be interpreted as the posterior probability of choosing one class $p(y=c|x)$

b)
— W(1) is of dimension (10, 15)
— b(1) is of dimension (15)
— W(2) is of dimension (15, 3)
— b(2) is of dimension (3)

c) $\theta = \theta - \alpha \Delta$
For batch gradient descent, $\Delta = \frac{1}{n} \sum_{i=1}^n \nabla_{\theta} L(f(x^{(i)}, \theta), y^{(i)})$
For mini-batch gradient descent, $\Delta = \frac{1}{m} \sum_{i=1}^m \nabla_{\theta} L(f(x^{(i)}, \theta), y^{(i)})$
For stochastic gradient descent, $\Delta = \nabla_{\theta} L(f(x^{(i)}, \theta), y^{(i)})$

3. Minimizing classification error would be ideal, but the function is non-smooth, so we need to use a surrogate loss instead.

Liked by 1 person

• Stéphanie Larocque says:

3. Furthermore, there is no gradient with classification error.

Liked by 2 people

2. In my answer of item 3 I just said that the classification error is not a differentiable function.

Like

• Stéphanie Larocque says:

Classification error is a sum of 0/1 loss for each data point so it’s differentiable at every point except at the points where the 0 jumps to 1. So I think this is not the main problem (like for relu, I think that if it’s not differentiable for only a countable number of points, that’s not “really” a problem since we most likely won’t fall on those points). I think the problem is more because the gradient is 0 everywhere (except where 0 jumps to 1) so we can’t learn. Do you think it’s correct?

Liked by 3 people

• I think so! I should have mentioned this aspect in my answer. Thanks.

Like