In this lecture, we will have a rather detailed discussion of optimization methods and their interpretation.

**Slides:**

**Reference: **(* = you are responsible for this material)

- *Chapter 8 of the Deep Learning textbook.

Advertisements

Skip to content
# IFT6266 – H2017 Deep Learning

## A Graduate Course Offered at Université de Montréal

# 14/15 – Optimization

##
5 thoughts on “14/15 – Optimization”

### Leave a Reply

In this lecture, we will have a rather detailed discussion of optimization methods and their interpretation.

**Slides:**

**Reference: **(* = you are responsible for this material)

- *Chapter 8 of the Deep Learning textbook.

Advertisements

%d bloggers like this:

Hello. The link for the slides is broken at the moment 🙂

Cheers.

LikeLike

Has anyone applied evolutionary algorithms (e.g. Genetic Algorithms, Differential Evolution, PSO, CMA-ES) to train DNNs? I can’t find many works in that direction. Is there any big issue in doing so? High number of parameters?

LikeLike

Although I don’t know much about this topic myself, I believe what you are looking for is the field of neuroevolution. https://en.wikipedia.org/wiki/Neuroevolution

The issue with optimization using evolutionary algorithms is that it is much slower than backpropagation for large networks. For example, if a neural network has 100 000 weights, then the gradient effectively gives 100 000 hints (one per weight) to help adjust them. In contrast, a genetic algorithm only uses the final ‘fitness’ score of the whole network to assess its performance and tune the weights.

I would note that genetic algorithms may be useful to optimize discrete quantities (like the number of hidden units) which backpropagation cannot be used for.

LikeLike

Thanks for the answer.

I’ve just saw this paper on the subject and would like to share:

https://arxiv.org/abs/1703.01041

LikeLike

Respect ADAM, I asked in class: why is there a bias in the estimators, and how it is corrected by the following expressions? It was not evident for me at the moment, just by looking at the pseudo-code.

Aaron pointed out that, taking the first iteration as an example, we have that

s = 0,

so the first “s” is estimated by:

s = (1-rho_1)*h,

which is a value that is lower than h. In this iteration, dividing by (1-rho_1^1), corrects the estimate, so that s = h. The same applies for the second moment estimates. It can be proven for further iterations as well, of course.

Note: This comment is only to leave trace in the blog of something that I asked in class. I frequently asked questions, but was very shy to come to the blog and post them.

LikeLike