microsoft/AI-For-Beginners

Public

mirrored fromhttps://github.com/microsoft/AI-For-BeginnersAvailable

CodeCommitsIssuesPull requestsActionsInsightsSecurity
5d97797124da3144ff63633bbe57da7df800b6bc

Branches

Tags

  • No tags available.
0Branches0Tags
Go to file
Add file
Code

Clone

HTTPS

Download ZIP

lessons/5-NLP/17-GenerativeNetworks/README.md

46lines · modecode

1# Generative networks
2
3Recurrent Neural Networks (RNNs) and their gated cell variants such as Long Short Term Memory Cells (LSTMs) and Gated Recurrent Units (GRUs) provided a mechanism for language modeling, i.e. they can learn word ordering and provide predictions for next word in a sequence. This allows us to use RNNs for **generative tasks**, such as ordinary text generation, machine translation, and even image captioning.
4
5In RNN architecture we discussed in the previous unit, each RNN unit produced next next hidden state as an output. However, we can also add another output to each recurrent unit, which would allow us to output a **sequence** (which is equal in length to the original sequence). Moreover, we can use RNN units that do not accept an input at each step, and just take some initial state vector, and then produce a sequence of outputs.
6
7This allows for different neural architectures that are shown in the picture below:
8
9![Image showing common recurrent neural network patterns.](images/unreasonable-effectiveness-of-rnn.jpg)
10
11> Image from blog post [Unreasonable Effectiveness of Recurrent Neural Networks](http://karpathy.github.io/2015/05/21/rnn-effectiveness/) by [Andrej Karpaty](http://karpathy.github.io/)
12
13* **One-to-one** is a traditional neural network with one input and one output
14* **One-to-many** is a generative architecture that accepts one input value, and generates a sequence of output values. For example, if we want to train **image captioning** network that would produce a textual description of a picture, we can a picture as input, pass it through CNN to obtain hidden state, and then have recurrent chain generate caption word-by-word
15* **Many-to-one** corresponds to RNN architectures we described in the previous unit, such as text classification
16* **Many-to-many**, or **sequence-to-sequence** corresponds to tasks such as **machine translation**, where we have first RNN collect all information from the input sequence into the hidden state, and another RNN chain unrolls this state into the output sequence.
17
18In this unit, we will focus on simple generative models that help us generate text. For simplicity, we will use character-level tokenization.
19
20The way we will train RNN to generate text is the following. On each step, we will take a sequence of characters of length `nchars`, and ask the network to generate next output character for each input character:
21
22![Image showing an example RNN generation of the word 'HELLO'.](images/rnn-generate.png)
23
24When generating text (during inference), we start with some **prompt**, which is passed through RNN cells to generate intermediate state, and then from this state the generation starts. We generate one character at a time, and pass the state and the generated character to another RNN cell to generate the next one, until we generate enough characters.
25
26<img src="images/rnn-generate-inf.png" width="60%"/>
27
28> Image by author
29## Continue to Notebooks
30
31* [Generative Networks with PyTorch](GenerativePyTorch.ipynb)
32* [Generative Networks with TensorFlow](GenerativeTF.ipynb)
33
34## Soft text generation and temperature
35
36Output of each RNN cell is a probability distribution of characters. If we always take the character with highest probability as the next character in generated text, the text often can become "cycled" between the same character sequences again and again, like in this example:
37
38```
39today of the second the company and a second the company ...
40```
41
42However, if we look at the probability distribution for the next character, it could be that the difference between a few highest probabilities is not huge, e.g. one character can have probability 0.2, another - 0.19, etc. For example, when looking for the next character in the sequence '*play*', next character can equally well be either space, or **e** (as in the word *player*).
43
44This leads us to the conclusion that it is not always "fair" to select the character with higher probability, because choosing the second highest might still lead us to meaningful text. It is more wise to **sample** characters from the probability distribution given by the network output. We can also use a parameter, **temperature**, that will flatten out the probability distribution, in case we want to add more randomness, or make it more steep, if we want to stick more to the highest-probability characters.
45
46Have a look at how this soft text generation is implemented in the notebooks.
47