In this lecture we introduce Recurrent Neural Networks.
Lecture 08 RNNs (slides from Hugo Larochelle)
Reference: (* = you are responsible for this material)
- *Chapter 10 of the Deep Learning textbook (sections. 10.1-10.11, we will cover the material in 10.12 later).
- Blog post on Understanding LSTM Networks by Chris Olah.
Advertisements
What was the tool that was mentioned in class for us to use in order to submit questions?
LikeLiked by 1 person
overleaf: https://www.overleaf.com
Share it with myself and both TAs.
— Aaron
LikeLiked by 1 person
In DeepLearningBook (10.2.1 – 10.2.2), two different recurrent nets are compared:
1 – With hidden-to-hidden connections
2 – Only with output-to-hidden connections (no hidden-to-hidden connections)
It says that computing gradient in the second case is easier to compute, because “there is no need to compute the output for the previous time step first, because the training set provides the ideal value of that output”.
I understand that if computation can be parallelized, then it needs less computation time. But I don’t understand what big difference these two architectures have (so that computing gradient become independent from the previous states).
LikeLike
Oh! Seems like it is explained after in the book (teacher forcing). Sorry!
LikeLike
Refer figure 10.4 in the text. Lets say we are considering the state at time-step t+1. The hidden layer h(t+1) has only two inputs x(t+1) and o(t). We already have the ideal value of o(t) from the training set. If we provide the true output o(t)_true as the input to the state at t+1 instead of the predicted o(t), the network will become decoupled and the gradients can be computed in isolation
LikeLike
Figure 10.6 (Train time) is more suitable for the explanation of @sebyjacob.
LikeLike
In addition to potential computation gains, I think the fact that having a ‘ground truth’ such as the training data reduces the bias in the gradients that would be used in your parameter’s update.
LikeLike
What is the difference between GRU and LSTM?
1.A GRU has two gates (reset gate r, and an update gate z), an LSTM has three gates, so we have more parameter in LSTM. This unit that is missing from the GRU is the controlled exposure of the memory content( controlled by the output gate in LSTM) but GRU exposes its full content without any control.
2. The LSTM unit computes the new memory content without any separate control of the amount of information flowing from the previous time step. Rather, the LSTM unit controls the amount of the new memory content being added to the memory cell independently from the forget gate. On the other hand, the
GRU controls the information flow from the previous activation when computing the new, candidate
activation( The activation in GRU is a linear interpolation between the previous activation and the candidate activation), but does not independently control the amount of the candidate activation being added
For detail description you can explore this Research Paper – https://arxiv.org/pdf/1412.3555v1.pdf The paper explains all this brilliantly.
LikeLike
I was wondering if someone had experienced with teacher forcing to answer my question. Since at test time we cannot use the true output as an input for our sequence being predicted, I was wondering if using the true data during training could actually hurt the model’s ability to generalize?
Isn’t the model expecting to receive a ground truth and then performs worst by not having it compared to being trained only using its own predictions?
LikeLike
When using the RNN for prediction, the ground-truth sequence is not available conditioning and
we sample from the joint distribution over the sequence by sampling each y_t
from its conditional distribution given the previously generated samples. Unfortunately, this procedure can result in problems in generation as small prediction error compound in the conditioning context. This can
lead to poor prediction performance as the RNN’s conditioning context (the sequence of previously
generated samples) diverge from sequences seen during training.
There are couple of papers which address this issue.
Scheduled Sampling, Professor Forcing – A New Algorithm for Training Recurrent Networks
LikeLike