02 – Introduction to Neural Nets

In this lecture we finish our overview of Machine Learning and begin our fairly detailed introduction to Neural Networks.

Lecture 02 artificial neurons (slides modified from Hugo Larochelle’s course notes)

Reference: (you are responsible for all of this material)

  • Chapter 6 of the Deep Learning textbook (by Ian Goodfellow, Yoshua Bengio and Aaron Courville).

6 thoughts on “02 – Introduction to Neural Nets

  1. I find interesting the fact that there is proof that if a single hidden layer feed-forward neural network (FFNN) is given enough neurons in the hidden layer, it can approximate any continuous function arbitrarily well (hornik, 1991).

    Why do we prefer “growing” the network in depth (adding hidden layers) than in width (adding neurons to the single hidden layer)? Is there a similar proof for an FFNN that is given “enough hidden layers”? Is there an equivalent “deep FFNN” for the function approximated by a particular single hidden layer FFNN?


    • Depth allows composition of features, i.e. neurons in upper layers can make use of functions computed by neurons in lower layers. Growing neural nets in depth is much more efficient than growing them in width, in the sense that you need dramatically fewer neurons to get the same level of approximation. I believe the best reference for this is “On the Expressive Power of Deep Architectures” (Bengio & Delalleau, 2011) available at http://www.iro.umontreal.ca/~lisa/bib/pub_subject/finance/pointeurs/ALT2011.pdf.

      Your second and third questions are answered, if trivially, by considering a single-layer neural network an instance of a deep neural network.


  2. Hi,

    I have a question concerning data with heteroskedasticity. The book talks a lot about deriving cost functions from maximum likelihood. When doing regression, we often use MSE to evaluate performance. However, MSE is derived using a constant variance assumption [1]. My questions are : what is the expected behavior of a NN fitted on heteroskedastic data ? Also, how would you define a loss function that would circumvent this problem ?


    [1] http://www.cs.mcgill.ca/~dprecup/courses/ML/Lectures/ml-lecture01.pdf, slides 33-34-35


  3. It is mentioned at slide 38 that mini-batch training can give a more accurate estimate of the gradient. However, doesn’t SGD with 1 sample have the same expected value as for any batch size? With what measure of accuracy is this true? I would understand if it stated that it lowers the variance of the gradient, maybe this is what it means?


  4. I want to share here some thoughts.

    In our session of questions and answers with Aaron the class before the exam, we talked about the hard-threshold units as a kind of activation that is not used for neural networks, because there is no gradient that can be used to learn.

    I mentioned privately to Aaron that there are ways to build neural networks with hard-threshold units. The process is not one of “training” but more of “building” in a greedy and constructive way, the neworks. In particular, MLPs for classification.

    Here is one example that is now old: http://citeseerx.ist.psu.edu/viewdoc/download?doi=

    Here is a more recent example https://www.researchgate.net/profile/Ubaldo_Garcia4/publication/220196841_Novel_linear_programming_approach_for_building_a_piecewise_nonlinear_binary_classifier_with_a_priori_accuracy/links/5720642a08aefa64889a93db.pdf

    In both papers, sequential use of linear programming to add neurons to a hidden layer allows the practitioner to generate an arbitrarily large hidden layer. In the second, another formulation, using mixed integer linear programming allows for better parameters, in the final iterations. There are given some strategies to cope with large databases.

    In this paper, another strategy is proposed to deal with large databases: https://link.springer.com/chapter/10.1007/978-3-319-08979-9_13

    Just wanted to share with you, that there is a way to build hard-threshold networks. Not that it is something that is still competitive, just a curiosity… well.. might be competitive for particular types of tasks, who knows!


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s