14/15 – Optimization

February 20, 2017February 21, 2017 aaroncourville Lectures

In this lecture, we will have a rather detailed discussion of optimization methods and their interpretation.

Slides:

Reference: (* = you are responsible for this material)

*Chapter 8 of the Deep Learning textbook.

5 thoughts on “14/15 – Optimization”

Philippe Paradis says:

February 20, 2017 at 9:19 am

Hello. The link for the slides is broken at the moment 🙂

Cheers.

LikeLike

Reply
Joao Monteiro says:

February 21, 2017 at 11:53 am

Has anyone applied evolutionary algorithms (e.g. Genetic Algorithms, Differential Evolution, PSO, CMA-ES) to train DNNs? I can’t find many works in that direction. Is there any big issue in doing so? High number of parameters?

LikeLike

Reply
- Wes Chung says:
  
  February 22, 2017 at 7:54 pm
  
  Although I don’t know much about this topic myself, I believe what you are looking for is the field of neuroevolution. https://en.wikipedia.org/wiki/Neuroevolution
  
  The issue with optimization using evolutionary algorithms is that it is much slower than backpropagation for large networks. For example, if a neural network has 100 000 weights, then the gradient effectively gives 100 000 hints (one per weight) to help adjust them. In contrast, a genetic algorithm only uses the final ‘fitness’ score of the whole network to assess its performance and tune the weights.
  
  I would note that genetic algorithms may be useful to optimize discrete quantities (like the number of hidden units) which backpropagation cannot be used for.
  
  LikeLike
  
  Reply
  - Joao Monteiro says:
    
    March 7, 2017 at 8:04 pm
    
    Thanks for the answer.
    
    I’ve just saw this paper on the subject and would like to share:
    
    https://arxiv.org/abs/1703.01041
    
    LikeLike
Orestes Gonzalo Manzanilla Salazar says:

May 2, 2017 at 2:30 am

Respect ADAM, I asked in class: why is there a bias in the estimators, and how it is corrected by the following expressions? It was not evident for me at the moment, just by looking at the pseudo-code.

Aaron pointed out that, taking the first iteration as an example, we have that

s = 0,

so the first “s” is estimated by:

s = (1-rho_1)*h,

which is a value that is lower than h. In this iteration, dividing by (1-rho_1^1), corrects the estimate, so that s = h. The same applies for the second moment estimates. It can be proven for further iterations as well, of course.

Note: This comment is only to leave trace in the blog of something that I asked in class. I frequently asked questions, but was very shy to come to the blog and post them.

LikeLike

Reply