Contributed by Philippe Lacaille.
Under a given training dataset , where and . Let’s further consider a
standard feedforward network with a hidden layer . For the rest of the exercise, we denote as the pre-activated layer and as the post-activated layer.
- Assuming the network’s goal is to do binary classification (with the
detailed structure above), what would be an appropriate:
- Activation function for the output layer ? What does the output represent under this activation function
- Loss function, represented by , that will be used in training? (You can write it as a function of ).
- Now let’s assume the following about the network, while still using the same dataset:
- Network goal is now to do multi-class classification
- The set of parameters
- has 10 features and there are 15 units at the hidden layer .
- What would be an appropriate activation function for the output layer under this structure? What does the output vector represent under this activation function?
- What are the number of dimensions for each of the parameters in ?
- Using gradient descent, what is the update function for the parameters (no regularization)? You can write it as a function of .
- if using batch gradient descent?
- if using mini-batch (size ) gradient descent?
- if using stochastic gradient descent?
- Under a classification problem, why isn’t the network trained to directly minimize the classification error?