microsoft/AI-For-Beginners
Publicmirrored fromhttps://github.com/microsoft/AI-For-BeginnersAvailable
lessons/5-NLP/16-RNN/RNNPyTorch.ipynb
472lines · modecode
| 1 | { |
| 2 | "cells": [ |
| 3 | { |
| 4 | "cell_type": "markdown", |
| 5 | "metadata": {}, |
| 6 | "source": [ |
| 7 | "# Recurrent neural networks\n", |
| 8 | "\n", |
| 9 | "In the previous module, we have been using rich semantic representations of text, and a simple linear classifier on top of the embeddings. What this architecture does is to capture aggregated meaning of words in a sentence, but it does not take into account the **order** of words, because aggregation operation on top of embeddings removed this information from the original text. Because these models are unable to model word ordering, they cannot solve more complex or ambiguous tasks such as text generation or question answering.\n", |
| 10 | "\n", |
| 11 | "To capture the meaning of text sequence, we need to use another neural network architecture, which is called a **recurrent neural network**, or RNN. In RNN, we pass our sentence through the network one symbol at a time, and the network produces some **state**, which we then pass to the network again with the next symbol.\n", |
| 12 | "\n", |
| 13 | "<img alt=\"RNN\" src=\"images/rnn.png\" width=\"60%\"/>\n", |
| 14 | "\n", |
| 15 | "Given the input sequence of tokens $X_0,\\dots,X_n$, RNN creates a sequence of neural network blocks, and trains this sequence end-to-end using back propagation. Each network block takes a pair $(X_i,S_i)$ as an input, and produces $S_{i+1}$ as a result. Final state $S_n$ or output $X_n$ goes into a linear classifier to produce the result. All network blocks share the same weights, and are trained end-to-end using one back propagation pass.\n", |
| 16 | "\n", |
| 17 | "Because state vectors $S_0,\\dots,S_n$ are passed through the network, it is able to learn the sequential dependencies between words. For example, when the word *not* appears somewhere in the sequence, it can learn to negate certain elements within the state vector, resulting in negation. \n", |
| 18 | "\n", |
| 19 | "> Since weights of all RNN blocks on the picture are shared, the same picture can be represented as one block (on the right) with a recurrent feedback loop, which passes output state of the network back to the input.\n", |
| 20 | "\n", |
| 21 | "Let's see how recurrent neural networks can help us classify our news dataset." |
| 22 | ] |
| 23 | }, |
| 24 | { |
| 25 | "cell_type": "code", |
| 26 | "execution_count": 1, |
| 27 | "metadata": {}, |
| 28 | "outputs": [ |
| 29 | { |
| 30 | "name": "stdout", |
| 31 | "output_type": "stream", |
| 32 | "text": [ |
| 33 | "Loading dataset...\n", |
| 34 | "Building vocab...\n" |
| 35 | ] |
| 36 | } |
| 37 | ], |
| 38 | "source": [ |
| 39 | "import torch\n", |
| 40 | "import torchtext\n", |
| 41 | "from torchnlp import *\n", |
| 42 | "train_dataset, test_dataset, classes, vocab = load_dataset()\n", |
| 43 | "vocab_size = len(vocab)" |
| 44 | ] |
| 45 | }, |
| 46 | { |
| 47 | "cell_type": "markdown", |
| 48 | "metadata": {}, |
| 49 | "source": [ |
| 50 | "## Simple RNN classifier\n", |
| 51 | "\n", |
| 52 | "In case of simple RNN, each recurrent unit is a simple linear network, which takes concatenated input vector and state vector, and produce a new state vector. PyTorch represents this unit with `RNNCell` class, and a networks of such cells - as `RNN` layer.\n", |
| 53 | "\n", |
| 54 | "To define an RNN classifier, we will first apply an embedding layer to lower the dimensionality of input vocabulary, and then have RNN layer on top of it: " |
| 55 | ] |
| 56 | }, |
| 57 | { |
| 58 | "cell_type": "code", |
| 59 | "execution_count": 2, |
| 60 | "metadata": {}, |
| 61 | "outputs": [], |
| 62 | "source": [ |
| 63 | "class RNNClassifier(torch.nn.Module):\n", |
| 64 | " def __init__(self, vocab_size, embed_dim, hidden_dim, num_class):\n", |
| 65 | " super().__init__()\n", |
| 66 | " self.hidden_dim = hidden_dim\n", |
| 67 | " self.embedding = torch.nn.Embedding(vocab_size, embed_dim)\n", |
| 68 | " self.rnn = torch.nn.RNN(embed_dim,hidden_dim,batch_first=True)\n", |
| 69 | " self.fc = torch.nn.Linear(hidden_dim, num_class)\n", |
| 70 | "\n", |
| 71 | " def forward(self, x):\n", |
| 72 | " batch_size = x.size(0)\n", |
| 73 | " x = self.embedding(x)\n", |
| 74 | " x,h = self.rnn(x)\n", |
| 75 | " return self.fc(x.mean(dim=1))" |
| 76 | ] |
| 77 | }, |
| 78 | { |
| 79 | "cell_type": "markdown", |
| 80 | "metadata": {}, |
| 81 | "source": [ |
| 82 | "> **Note:** We use untrained embedding layer here for simplicity, but for even better results we can use pre-trained embedding layer with Word2Vec or GloVe embeddings, as described in the previous unit. For better understanding, you might want to adapt this code to work with pre-trained embeddings.\n", |
| 83 | "\n", |
| 84 | "In our case, we will use padded data loader, so each batch will have a number of padded sequences of the same length. RNN layer will take the sequence of embedding tensors, and produce two outputs: \n", |
| 85 | "* $x$ is a sequence of RNN cell outputs at each step\n", |
| 86 | "* $h$ is a final hidden state for the last element of the sequence\n", |
| 87 | "\n", |
| 88 | "We then apply a fully-connected linear classifier to get the number of class.\n", |
| 89 | "\n", |
| 90 | "> **Note:** RNNs are quite difficult to train, because once the RNN cells are unrolled along the sequence length, the resulting number of layers involved in back propagation is quite large. Thus we need to select small learning rate, and train the network on larger dataset to produce good results. It can take quite a long time, so using GPU is preferred." |
| 91 | ] |
| 92 | }, |
| 93 | { |
| 94 | "cell_type": "code", |
| 95 | "execution_count": 3, |
| 96 | "metadata": { |
| 97 | "scrolled": true |
| 98 | }, |
| 99 | "outputs": [ |
| 100 | { |
| 101 | "name": "stdout", |
| 102 | "output_type": "stream", |
| 103 | "text": [ |
| 104 | "3200: acc=0.3090625\n", |
| 105 | "6400: acc=0.38921875\n", |
| 106 | "9600: acc=0.4590625\n", |
| 107 | "12800: acc=0.511953125\n", |
| 108 | "16000: acc=0.5506875\n", |
| 109 | "19200: acc=0.57921875\n", |
| 110 | "22400: acc=0.6070089285714285\n", |
| 111 | "25600: acc=0.6304296875\n", |
| 112 | "28800: acc=0.6484027777777778\n", |
| 113 | "32000: acc=0.66509375\n", |
| 114 | "35200: acc=0.6790056818181818\n", |
| 115 | "38400: acc=0.6929166666666666\n", |
| 116 | "41600: acc=0.7035817307692308\n", |
| 117 | "44800: acc=0.7137276785714286\n", |
| 118 | "48000: acc=0.72225\n", |
| 119 | "51200: acc=0.73001953125\n", |
| 120 | "54400: acc=0.7372794117647059\n", |
| 121 | "57600: acc=0.7436631944444444\n", |
| 122 | "60800: acc=0.7503947368421052\n", |
| 123 | "64000: acc=0.75634375\n", |
| 124 | "67200: acc=0.7615773809523809\n", |
| 125 | "70400: acc=0.7662642045454545\n", |
| 126 | "73600: acc=0.7708423913043478\n", |
| 127 | "76800: acc=0.7751822916666666\n", |
| 128 | "80000: acc=0.7790625\n", |
| 129 | "83200: acc=0.7825\n", |
| 130 | "86400: acc=0.7858564814814815\n", |
| 131 | "89600: acc=0.7890513392857142\n", |
| 132 | "92800: acc=0.7920474137931034\n", |
| 133 | "96000: acc=0.7952708333333334\n", |
| 134 | "99200: acc=0.7982258064516129\n", |
| 135 | "102400: acc=0.80099609375\n", |
| 136 | "105600: acc=0.8037594696969697\n", |
| 137 | "108800: acc=0.8060569852941176\n" |
| 138 | ] |
| 139 | } |
| 140 | ], |
| 141 | "source": [ |
| 142 | "train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=16, collate_fn=padify, shuffle=True)\n", |
| 143 | "net = RNNClassifier(vocab_size,64,32,len(classes)).to(device)\n", |
| 144 | "train_epoch(net,train_loader, lr=0.001)" |
| 145 | ] |
| 146 | }, |
| 147 | { |
| 148 | "cell_type": "markdown", |
| 149 | "metadata": {}, |
| 150 | "source": [ |
| 151 | "## Long Short Term Memory (LSTM)\n", |
| 152 | "\n", |
| 153 | "One of the main problems of classical RNNs is so-called **vanishing gradients** problem. Because RNNs are trained end-to-end in one back-propagation pass, it is having hard times propagating error to the first layers of the network, and thus the network cannot learn relationships between distant tokens. One of the ways to avoid this problem is to introduce **explicit state management** by using so called **gates**. There are two most known architectures of this kind: **Long Short Term Memory** (LSTM) and **Gated Relay Unit** (GRU).\n", |
| 154 | "\n", |
| 155 | "\n", |
| 156 | "\n", |
| 157 | "LSTM Network is organized in a manner similar to RNN, but there are two states that are being passed from layer to layer: actual state $c$, and hidden vector $h$. At each unit, hidden vector $h_i$ is concatenated with input $x_i$, and they control what happens to the state $c$ via **gates**. Each gate is a neural network with sigmoid activation (output in the range $[0,1]$), which can be thought of as bitwise mask when multiplied by the state vector. There are the following gates (from left to right on the picture above):\n", |
| 158 | "* **forget gate** takes hidden vector and determines, which components of the vector $c$ we need to forget, and which to pass through. \n", |
| 159 | "* **input gate** takes some information from the input and hidden vector, and inserts it into state.\n", |
| 160 | "* **output gate** transforms state via some linear layer with $\\tanh$ activation, then selects some of its components using hidden vector $h_i$ to produce new state $c_{i+1}$.\n", |
| 161 | "\n", |
| 162 | "Components of the state $c$ can be thought of as some flags that can be switched on and off. For example, when we encounter a name *Alice* in the sequence, we may want to assume that it refers to female character, and raise the flag in the state that we have female noun in the sentence. When we further encounter phrases *and Tom*, we will raise the flag that we have plural noun. Thus by manipulating state we can supposedly keep track of grammatical properties of sentence parts.\n", |
| 163 | "\n", |
| 164 | "> **Note**: A great resource for understanding internals of LSTM is this great article [Understanding LSTM Networks](https://colah.github.io/posts/2015-08-Understanding-LSTMs/) by Christopher Olah.\n", |
| 165 | "\n", |
| 166 | "While internal structure of LSTM cell may look complex, PyTorch hides this implementation inside `LSTMCell` class, and provides `LSTM` object to represent the whole LSTM layer. Thus, implementation of LSTM classifier will be pretty similar to the simple RNN which we have seen above:" |
| 167 | ] |
| 168 | }, |
| 169 | { |
| 170 | "cell_type": "code", |
| 171 | "execution_count": 4, |
| 172 | "metadata": {}, |
| 173 | "outputs": [], |
| 174 | "source": [ |
| 175 | "class LSTMClassifier(torch.nn.Module):\n", |
| 176 | " def __init__(self, vocab_size, embed_dim, hidden_dim, num_class):\n", |
| 177 | " super().__init__()\n", |
| 178 | " self.hidden_dim = hidden_dim\n", |
| 179 | " self.embedding = torch.nn.Embedding(vocab_size, embed_dim)\n", |
| 180 | " self.embedding.weight.data = torch.randn_like(self.embedding.weight.data)-0.5\n", |
| 181 | " self.rnn = torch.nn.LSTM(embed_dim,hidden_dim,batch_first=True)\n", |
| 182 | " self.fc = torch.nn.Linear(hidden_dim, num_class)\n", |
| 183 | "\n", |
| 184 | " def forward(self, x):\n", |
| 185 | " batch_size = x.size(0)\n", |
| 186 | " x = self.embedding(x)\n", |
| 187 | " x,(h,c) = self.rnn(x)\n", |
| 188 | " return self.fc(h[-1])" |
| 189 | ] |
| 190 | }, |
| 191 | { |
| 192 | "cell_type": "markdown", |
| 193 | "metadata": {}, |
| 194 | "source": [ |
| 195 | "Now let's train our network. Note that training LSTM is also quite slow, and you may not seem much raise in accuracy in the beginning of training. Also, you may need to play with `lr` learning rate parameter to find the learning rate that results in reasonable training speed, and yet does not cause memory waste." |
| 196 | ] |
| 197 | }, |
| 198 | { |
| 199 | "cell_type": "code", |
| 200 | "execution_count": 5, |
| 201 | "metadata": {}, |
| 202 | "outputs": [ |
| 203 | { |
| 204 | "name": "stdout", |
| 205 | "output_type": "stream", |
| 206 | "text": [ |
| 207 | "3200: acc=0.259375\n", |
| 208 | "6400: acc=0.25859375\n", |
| 209 | "9600: acc=0.26177083333333334\n", |
| 210 | "12800: acc=0.2784375\n", |
| 211 | "16000: acc=0.313\n", |
| 212 | "19200: acc=0.3528645833333333\n", |
| 213 | "22400: acc=0.3965625\n", |
| 214 | "25600: acc=0.4385546875\n", |
| 215 | "28800: acc=0.4752777777777778\n", |
| 216 | "32000: acc=0.505375\n", |
| 217 | "35200: acc=0.5326704545454546\n", |
| 218 | "38400: acc=0.5557552083333334\n", |
| 219 | "41600: acc=0.5760817307692307\n", |
| 220 | "44800: acc=0.5954910714285714\n", |
| 221 | "48000: acc=0.6118333333333333\n", |
| 222 | "51200: acc=0.62681640625\n", |
| 223 | "54400: acc=0.6404779411764706\n", |
| 224 | "57600: acc=0.6520138888888889\n", |
| 225 | "60800: acc=0.662828947368421\n", |
| 226 | "64000: acc=0.673546875\n", |
| 227 | "67200: acc=0.6831547619047619\n", |
| 228 | "70400: acc=0.6917897727272727\n", |
| 229 | "73600: acc=0.6997146739130434\n", |
| 230 | "76800: acc=0.707109375\n", |
| 231 | "80000: acc=0.714075\n", |
| 232 | "83200: acc=0.7209134615384616\n", |
| 233 | "86400: acc=0.727037037037037\n", |
| 234 | "89600: acc=0.7326674107142858\n", |
| 235 | "92800: acc=0.7379633620689655\n", |
| 236 | "96000: acc=0.7433645833333333\n", |
| 237 | "99200: acc=0.7479032258064516\n", |
| 238 | "102400: acc=0.752119140625\n", |
| 239 | "105600: acc=0.7562405303030303\n", |
| 240 | "108800: acc=0.76015625\n", |
| 241 | "112000: acc=0.7641339285714286\n", |
| 242 | "115200: acc=0.7677777777777778\n", |
| 243 | "118400: acc=0.7711233108108108\n" |
| 244 | ] |
| 245 | }, |
| 246 | { |
| 247 | "data": { |
| 248 | "text/plain": [ |
| 249 | "(0.03487814127604167, 0.7728)" |
| 250 | ] |
| 251 | }, |
| 252 | "execution_count": 5, |
| 253 | "metadata": {}, |
| 254 | "output_type": "execute_result" |
| 255 | } |
| 256 | ], |
| 257 | "source": [ |
| 258 | "net = LSTMClassifier(vocab_size,64,32,len(classes)).to(device)\n", |
| 259 | "train_epoch(net,train_loader, lr=0.001)" |
| 260 | ] |
| 261 | }, |
| 262 | { |
| 263 | "cell_type": "markdown", |
| 264 | "metadata": {}, |
| 265 | "source": [ |
| 266 | "## Packed sequences\n", |
| 267 | "\n", |
| 268 | "In our example, we had to pad all sequences in the minibatch with zero vectors. While it results in some memory waste, with RNNs it is more critical that additional RNN cells are created for the padded input items, which take part in training, yet do not carry any important input information. It would be much better to train RNN only to the actual sequence size.\n", |
| 269 | "\n", |
| 270 | "To do that, a special format of padded sequence storage is introduced in PyTorch. Suppose we have input padded minibatch which looks like this:\n", |
| 271 | "```\n", |
| 272 | "[[1,2,3,4,5],\n", |
| 273 | " [6,7,8,0,0],\n", |
| 274 | " [9,0,0,0,0]]\n", |
| 275 | "```\n", |
| 276 | "Here 0 represents padded values, and the actual length vector of input sequences is `[5,3,1]`.\n", |
| 277 | "\n", |
| 278 | "In order to effectively train RNN with padded sequence, we want to begin training first group of RNN cells with large minibatch (`[1,6,9]`), but then end processing of third sequence, and continue training with shorted minibatches (`[2,7]`, `[3,8]`), and so on. Thus, packed sequence is represented as one vector - in our case `[1,6,9,2,7,3,8,4,5]`, and length vector (`[5,3,1]`), from which we can easily reconstruct the original padded minibatch.\n", |
| 279 | "\n", |
| 280 | "To produce packed sequence, we can use `torch.nn.utils.rnn.pack_padded_sequence` function. All recurrent layers, including RNN, LSTM and GRU, support packed sequences as input, and produce packed output, which can be decoded using `torch.nn.utils.rnn.pad_packed_sequence`.\n", |
| 281 | "\n", |
| 282 | "To be able to produce packed sequence, we need to pass length vector to the network, and thus we need a different function to prepare minibatches:" |
| 283 | ] |
| 284 | }, |
| 285 | { |
| 286 | "cell_type": "code", |
| 287 | "execution_count": 6, |
| 288 | "metadata": {}, |
| 289 | "outputs": [], |
| 290 | "source": [ |
| 291 | "def pad_length(b):\n", |
| 292 | " # build vectorized sequence\n", |
| 293 | " v = [encode(x[1]) for x in b]\n", |
| 294 | " # compute max length of a sequence in this minibatch and length sequence itself\n", |
| 295 | " len_seq = list(map(len,v))\n", |
| 296 | " l = max(len_seq)\n", |
| 297 | " return ( # tuple of three tensors - labels, padded features, length sequence\n", |
| 298 | " torch.LongTensor([t[0]-1 for t in b]),\n", |
| 299 | " torch.stack([torch.nn.functional.pad(torch.tensor(t),(0,l-len(t)),mode='constant',value=0) for t in v]),\n", |
| 300 | " torch.tensor(len_seq)\n", |
| 301 | " )\n", |
| 302 | "\n", |
| 303 | "train_loader_len = torch.utils.data.DataLoader(train_dataset, batch_size=16, collate_fn=pad_length, shuffle=True)" |
| 304 | ] |
| 305 | }, |
| 306 | { |
| 307 | "cell_type": "markdown", |
| 308 | "metadata": {}, |
| 309 | "source": [ |
| 310 | "Actual network would be very similar to `LSTMClassifier` above, but `forward` pass will receive both padded minibatch and the vector of sequence lengths. After computing the embedding, we compute packed sequence, pass it to LSTM layer, and then unpack the result back.\n", |
| 311 | "\n", |
| 312 | "> **Note**: We actually do not use unpacked result `x`, because we use output from the hidden layers in the following computations. Thus, we can remove the unpacking altogether from this code. The reason we place it here is for you to be able to modify this code easily, in case you should need to use network output in further computations." |
| 313 | ] |
| 314 | }, |
| 315 | { |
| 316 | "cell_type": "code", |
| 317 | "execution_count": 7, |
| 318 | "metadata": {}, |
| 319 | "outputs": [], |
| 320 | "source": [ |
| 321 | "class LSTMPackClassifier(torch.nn.Module):\n", |
| 322 | " def __init__(self, vocab_size, embed_dim, hidden_dim, num_class):\n", |
| 323 | " super().__init__()\n", |
| 324 | " self.hidden_dim = hidden_dim\n", |
| 325 | " self.embedding = torch.nn.Embedding(vocab_size, embed_dim)\n", |
| 326 | " self.embedding.weight.data = torch.randn_like(self.embedding.weight.data)-0.5\n", |
| 327 | " self.rnn = torch.nn.LSTM(embed_dim,hidden_dim,batch_first=True)\n", |
| 328 | " self.fc = torch.nn.Linear(hidden_dim, num_class)\n", |
| 329 | "\n", |
| 330 | " def forward(self, x, lengths):\n", |
| 331 | " batch_size = x.size(0)\n", |
| 332 | " x = self.embedding(x)\n", |
| 333 | " pad_x = torch.nn.utils.rnn.pack_padded_sequence(x,lengths,batch_first=True,enforce_sorted=False)\n", |
| 334 | " pad_x,(h,c) = self.rnn(pad_x)\n", |
| 335 | " x, _ = torch.nn.utils.rnn.pad_packed_sequence(pad_x,batch_first=True)\n", |
| 336 | " return self.fc(h[-1])" |
| 337 | ] |
| 338 | }, |
| 339 | { |
| 340 | "cell_type": "markdown", |
| 341 | "metadata": {}, |
| 342 | "source": [ |
| 343 | "Now let's do the training:" |
| 344 | ] |
| 345 | }, |
| 346 | { |
| 347 | "cell_type": "code", |
| 348 | "execution_count": 8, |
| 349 | "metadata": { |
| 350 | "scrolled": true |
| 351 | }, |
| 352 | "outputs": [ |
| 353 | { |
| 354 | "name": "stdout", |
| 355 | "output_type": "stream", |
| 356 | "text": [ |
| 357 | "3200: acc=0.285625\n", |
| 358 | "6400: acc=0.33359375\n", |
| 359 | "9600: acc=0.3876041666666667\n", |
| 360 | "12800: acc=0.44078125\n", |
| 361 | "16000: acc=0.4825\n", |
| 362 | "19200: acc=0.5235416666666667\n", |
| 363 | "22400: acc=0.5559821428571429\n", |
| 364 | "25600: acc=0.58609375\n", |
| 365 | "28800: acc=0.6116666666666667\n", |
| 366 | "32000: acc=0.63340625\n", |
| 367 | "35200: acc=0.6525284090909091\n", |
| 368 | "38400: acc=0.668515625\n", |
| 369 | "41600: acc=0.6822596153846154\n", |
| 370 | "44800: acc=0.6948214285714286\n", |
| 371 | "48000: acc=0.7052708333333333\n", |
| 372 | "51200: acc=0.71521484375\n", |
| 373 | "54400: acc=0.7239889705882353\n", |
| 374 | "57600: acc=0.7315277777777778\n", |
| 375 | "60800: acc=0.7388486842105263\n", |
| 376 | "64000: acc=0.74571875\n", |
| 377 | "67200: acc=0.7518303571428572\n", |
| 378 | "70400: acc=0.7576988636363636\n", |
| 379 | "73600: acc=0.7628940217391305\n", |
| 380 | "76800: acc=0.7681510416666667\n", |
| 381 | "80000: acc=0.7728125\n", |
| 382 | "83200: acc=0.7772235576923077\n", |
| 383 | "86400: acc=0.7815393518518519\n", |
| 384 | "89600: acc=0.7857700892857142\n", |
| 385 | "92800: acc=0.7895043103448276\n", |
| 386 | "96000: acc=0.7930520833333333\n", |
| 387 | "99200: acc=0.7959072580645161\n", |
| 388 | "102400: acc=0.798994140625\n", |
| 389 | "105600: acc=0.802064393939394\n", |
| 390 | "108800: acc=0.8051378676470589\n", |
| 391 | "112000: acc=0.8077857142857143\n", |
| 392 | "115200: acc=0.8104600694444445\n", |
| 393 | "118400: acc=0.8128293918918919\n" |
| 394 | ] |
| 395 | }, |
| 396 | { |
| 397 | "data": { |
| 398 | "text/plain": [ |
| 399 | "(0.029785829671223958, 0.8138166666666666)" |
| 400 | ] |
| 401 | }, |
| 402 | "execution_count": 8, |
| 403 | "metadata": {}, |
| 404 | "output_type": "execute_result" |
| 405 | } |
| 406 | ], |
| 407 | "source": [ |
| 408 | "net = LSTMPackClassifier(vocab_size,64,32,len(classes)).to(device)\n", |
| 409 | "train_epoch_emb(net,train_loader_len, lr=0.001,use_pack_sequence=True)\n" |
| 410 | ] |
| 411 | }, |
| 412 | { |
| 413 | "cell_type": "markdown", |
| 414 | "metadata": {}, |
| 415 | "source": [ |
| 416 | "> **Note:** You may have noticed the parameter `use_pack_sequence` that we pass to the training function. Currently, `pack_padded_sequence` function requires length sequence tensor to be on CPU device, and thus training function needs to avoid moving the length sequence data to GPU when training. You can look into implementation of `train_emb` function in the [`torchnlp.py`](torchnlp.py) file." |
| 417 | ] |
| 418 | }, |
| 419 | { |
| 420 | "cell_type": "markdown", |
| 421 | "metadata": {}, |
| 422 | "source": [ |
| 423 | "## Bidirectional and multilayer RNNs\n", |
| 424 | "\n", |
| 425 | "In our examples, all recurrent networks operated in one direction, from beginning of a sequence to the end. It looks natural, because it resembles the way we read and listen to speech. However, since in many practical cases we have random access to the input sequence, it might make sense to run recurrent computation in both directions. Such networks are call **bidirectional** RNNs, and they can be created by passing `bidirectional=True` parameter to RNN/LSTM/GRU constructor.\n", |
| 426 | "\n", |
| 427 | "When dealing with bidirectional network, we would need two hidden state vectors, one for each direction. PyTorch encodes those vectors as one vector of twice larger size, which is quite convenient, because you would normally pass the resulting hidden state to fully-connected linear layer, and you would just need to take this increase in size into account when creating the layer.\n", |
| 428 | "\n", |
| 429 | "Recurrent network, one-directional or bidirectional, captures certain patterns within a sequence, and can store them into state vector or pass into output. As with convolutional networks, we can build another recurrent layer on top of the first one to capture higher level patterns, build from low-level patterns extracted by the first layer. This leads us to the notion of **multi-layer RNN**, which consists of two or more recurrent networks, where output of the previous layer is passed to the next layer as input.\n", |
| 430 | "\n", |
| 431 | "\n", |
| 432 | "\n", |
| 433 | "*Picture from [this wonderful post](https://towardsdatascience.com/from-a-lstm-cell-to-a-multilayer-lstm-network-with-pytorch-2899eb5696f3) by Fernando López*\n", |
| 434 | "\n", |
| 435 | "PyTorch makes constructing such networks an easy task, because you just need to pass `num_layers` parameter to RNN/LSTM/GRU constructor to build several layers of recurrence automatically. This would also mean that the size of hidden/state vector would increase proportionally, and you would need to take this into account when handling the output of recurrent layers." |
| 436 | ] |
| 437 | }, |
| 438 | { |
| 439 | "cell_type": "markdown", |
| 440 | "metadata": {}, |
| 441 | "source": [ |
| 442 | "## RNNs for other tasks\n", |
| 443 | "\n", |
| 444 | "In this unit, we have seen that RNNs can be used for sequence classification, but in fact, they can handle many more tasks, such as text generation, machine translation, and more. We will consider those tasks in the next unit." |
| 445 | ] |
| 446 | } |
| 447 | ], |
| 448 | "metadata": { |
| 449 | "interpreter": { |
| 450 | "hash": "16af2a8bbb083ea23e5e41c7f5787656b2ce26968575d8763f2c4b17f9cd711f" |
| 451 | }, |
| 452 | "kernelspec": { |
| 453 | "display_name": "Python 3.8.12 ('py38')", |
| 454 | "language": "python", |
| 455 | "name": "python3" |
| 456 | }, |
| 457 | "language_info": { |
| 458 | "codemirror_mode": { |
| 459 | "name": "ipython", |
| 460 | "version": 3 |
| 461 | }, |
| 462 | "file_extension": ".py", |
| 463 | "mimetype": "text/x-python", |
| 464 | "name": "python", |
| 465 | "nbconvert_exporter": "python", |
| 466 | "pygments_lexer": "ipython3", |
| 467 | "version": "3.8.12" |
| 468 | } |
| 469 | }, |
| 470 | "nbformat": 4, |
| 471 | "nbformat_minor": 2 |
| 472 | } |
| 473 | |