12 / 13 – Regularization

In this lecture, we will have a rather detailed discussion of regularization methods and their interpretation.


Reference: (* = you are responsible for this material)


13 thoughts on “12 / 13 – Regularization

  1. Hi,

    In page 223 of the Deep Learning book it says that “Also, regularizing the bias parameters can introduce a significant amount of underfitting”.
    I am able to understand why we should not regularize the bias, but I didn’t get the reason why regularizing it would introduce a significant amount of underfitting.

    Thanks! 🙂


    • Think of the simplest linear regression problem. Lets say we have a single input variable x and output variable y. Say we want to fit a line to the data. This would be given by the general equation y = w1*x + w0.

      Consider two cases in this scenario:
      1) w1 and w0 are both regularized
      2) Only w1 is regularized.

      At this point, think of w1 as the slope of the line and w0 as the y-intercept.

      In case 1, w1 and w0 are both regularized (i.e. limited) . Whatever slope the line assumes, there is only a single y-intercept value it can take.

      In case 2, if only w1 is regularized, just the slope is constrained. The y-intercept still has infinite possibilities. The learning algorithm will pick the y-intercept which minimises the error function at the fitted slope.

      Intuitively, in the first case, the ability of the line to generalise to the data is critically limited, whereas in the second case, the bias term can shift the line closer to the points to reduce the error.

      ** I “think” this should be the reason from my intuition. Further investigation might be necessary.


    • Regularizing the bias penalizes the bias and in case of L1, it can push it to 0. Zero bias means underfiting since the role of the bias is to position the decision boundary in the right place.


      • I a not sure I fully understand your explication. Would that be the case if only a strong emphasis is made on the regularizing term? With your explanation I feel you could say the same thing about the set of weights. Given the wrong emphasis on regularization you could force them to 0 also, and therefore having a poorly performing function of these weights.

        I do agree the goal of the regularization is to limit the model’s capacity and too much of it can have a negative impact on the performance on the model though.


  2. When early stopping is presented – in either the slides or the textbook – 2 algorithms are shown to use the validation set afterwards for training. As no mention was made in class or in the textbook, I wonder how k-fold cross-validation could be a good alternative method for these algorithms to avoid “wasting potentially good training data”. The way I see it is, the reason why some people would be tempted to use the validation set for training is that they are afraid that this set contains data that would allow the model to achieve better generalization. K-fold cross-validation would include all training+validation data in the training of (k-1) of the models generated. Therefore if a particular data point allow to better pinpoint the discriminative features of a class (for instance, having a bat in the training data would remove “not flying” from the discriminative features of “mammals”), k-1 models would potentially have a better performance on the validation set than the 1 model where the point was not part of the training. In this case, the model to choose amongst the k to minimize the generalization error would logically be the 1 with minimal validation error. This would allow, I think, a better generalization than setting the validation set arbitrarily or trying to train on it afterwards, not knowing if the number of epochs/iterations would still be good on the extended training dataset, or lead to over/underfitting.

    What do you guys think?


    • I’m not sure if we covered it in class, but what you propose looks a lot like bagging ( https://en.wikipedia.org/wiki/Bootstrap_aggregating ), where we split the data in k parts (with or with replacement), train the models, and then take the average. Doing so reduces the variance of the models an and helps to prevent overfitting.

      Now, considering your method, we have a problem with the model selection. If we only take one model (like you proposed), which one do we keep? We can’t really use the one with the best results on the validation set, because here each model has a different valdiation set. We would than need a validation-validation-set or something. (And this too arise questions. What data do I use for this set? Do I have enough data to allow me to have two validation sets? etc.)


    • I think that the problem in using cross-validation in the deep learning context is related to the huge training time required in most applications. It could be unpractical to have such long training times. Strategies as dropout and early stopping seem to do well on regularizing without the extra cost of k-1 retrainings.


  3. Hi,

    I remember that we looked at a paper in class to do the demonstration showing the equivalence between the arithmetic mean and the geometric mean but this article is not provided here. So my question is: are we responsible for this material ? If yes, could you (Aaron) provide the paper’s link ?


    • We didn’t see the *equivalence* of the arithmetic mean to the geometric mean. What we saw in class was the development showing that dropout in a single layer model corresponds to the geometric mean. I think this is what you are referring to and it’s in the Deep Learning textbook sec. 7.12 (which was assigned and linked as part of chap. 7).



      • The only proof that I can find in the textbook is the weight scaling inference rule page 256,257. There is no development showing that dropout in a single layer model corresponds to the geometric mean.

        Liked by 1 person

  4. In this lecture we were discussing, among other things, about the use of validation set to better choose the hyper-parameters, as well as early-stopping.

    I asked the following: Is it pertinent to use Statistical Design of Experiments, or similar techniques, instead of random search, to “guide” the search for hyper-parameters?

    Aaron said that this, was done in fact. He mentioned that it is an active area of research. The art of initializing, and very very good tunning, can be the difference between one level of performance and another. Much more so than it happens with innovation in optimization techniques.

    I went through the internet, and found this interesting paper: http://papers.nips.cc/paper/4522-practical-bayesian-optimization-of-machine-learning-algorithms.pdf

    In it, it is mentioned the approach of guiding the search to places in the space of hyper-parameters that maximize the expectation of improvement. Personally, I think we all should really pay attention to what is being developed in this particular sub-problem of the field!

    Note: This comment is only to leave trace in the blog of something that I asked in class. I frequently asked questions, but was very shy to come to the blog and post them.


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s