Q4 – Capacité vs early-stopping

De Stéphanie Larocque

Pourquoi la capacité d’un modèle augmente lorsque le nombre d’itérations augmente? Expliquez l’intuition derrière le lien entre l’early-stopping et le $L_2$ weight decay pour un modèle linéaire.

5 thoughts on “Q4 – Capacité vs early-stopping”

1. A possible version of the answers for this question :

Q1 – Why the capacity of a model increases when the number of training iterations increases ?

A1 – Because the more a model will be trained, the more its parameters will be tuned to try to fit perfectly to the training set distribution. In other words, it corresponds to a progressive increase of the model’s capacity.

It is then necessary to control the number of training iterations, because we can reach a region of parameters that will lead to overfitting (low bias and high variance). In a situation of overfitting, the model’s capacity is too high. The training error is then very low whereas it has a high generalization error.

Q2 – Explain the intuition behind the link between early-stopping and L2 regularization (weight decay) for a linear model

A2 – Without mathematical proof, we can see early stopping as a method which restricts the optimization procedure to a small region of parameters, in the neighborhood of the initial parameter value. In other word, it is quite similar to a regularization method, which also restrict the parameters values by encoding prior knowledge into the model.

It can be shown that, by approximating the loss function with a quadratic approximation in the neighborhood of the empirical optimum, and by updating the parameters with the gradient descent method, the optimum found with early stopping is the same as the one found with L2 regularization (under certain conditions).

We can then conclude that, with a linear model using a quadratic error function and a simple gradient descent, early stopping is equivalent to L2 regularization.

Like

2. Stéphanie Larocque says:

My own answer was very similar :

A1 : Let’s suppose the gradient is bounded by \$L\$ (to simplify), the gradient step is \$s\$ and let \$w0\$ be the initial parameters value. With \$k\$ steps, we can reach any set of parameters that are at most \$s*L*k\$ distance from the initial value \$w0\$. The parameter space it can reach is then defined by a sphere of radius s*L*k centered at w0.

Therefore, if k1 > k2, then the model can reach more parameters with k1 steps and then have more capacity.

A2: L2 weight decay is often center at 0 : penalty(w) = ||w||. However, we can “center” it at other points. Instead of 0, we can center it at the initial value w0. The penalty would become penalty(w) = ||w-w0||. The model would then be penalized if the parameters go too far from w0.

Since the number of steps does not only penalize from going too far from w0 but does not allow it, we could slightly modify the penalty to prevent the parameters from going outside the sphere of radius s*L*k centered at w0:

penalty'(w)= lambda* max{ ||w-w0|| – s*L*k, 0 }

With a very big lambda, this will cause the parameters to *almost* never go outside that sphere, like early stopping does.

Like

• Lucas Caccia says:

Cool answer! Great way to provide an intuitive understanding of how L2 regularization and early stopping relate.

Like

3. My understanding is simply that as the number of iteration increases, the model can reach more possible values of parameters and therefore increasing the capacity of the model.

In the extreme case where no iterations are made, the model will have very low capacity (if not none) since the model will the one determined by the initial weights.

Like

4. Great responses!

I would also add that when using tanh-like output functions, since weights are generally initialized at low values, they start off operating in an approximately linear region.

Early stopping then prevents most weights from reaching values high enough to yield very non-linear responses, where capacity increases much more quickly then in a linear regime.

Thus, not only does early stopping then prevent from searching for parameters in too large a space, it also prevents from searching it the regions of parameter space that present the largest capacities!

Like