microsoft/AI-For-Beginners

Public

mirrored fromhttps://github.com/microsoft/AI-For-BeginnersAvailable

CodeCommitsIssuesPull requestsActionsInsightsSecurity
88034d51145c0d8bc71779cf98037daab83275c8

Branches

Tags

  • No tags available.
0Branches0Tags
Go to file
Add file
Code

Clone

HTTPS

Download ZIP

lessons/5-NLP/15-LanguageModeling/README.md

41lines · modecode

1# Language Modeling
2
3Semantic embeddings, such as Word2Vec and GloVe, are in fact a first step towards **language modeling** - creating models that somehow *understand* (or *represent*) the nature of the language.
4
5## [Pre-lecture quiz](https://black-ground-0cc93280f.1.azurestaticapps.net/quiz/115)
6
7The main idea behind language modeling is training them on unlabeled datasets in an unsupervised manner. This is important because we have huge amounts of unlabeled text available, while the amount of labeled text would always be limited by the amount of effort we can spend on labeling. Most often, we can build language models that can **predict missing words** in the text, because it is easy to mask out a random word in text and use it as a training sample.
8
9## Training Embeddings
10
11In our previous examples, we used pre-trained semantic embeddings, but it is interesting to see how those embeddings can be trained. There are several possible ideas the can be used:
12
13* **N-Gram** language modeling, when we predict a token by looking at N previous tokens (N-gram)
14* **Continuous Bag-of-Words** (CBoW), when we predict the middle token $W_0$ in a token sequence $W_{-N}$, ..., $W_N$.
15* **Skip-gram**, where we predict a set of neighboring tokens {$W_{-N},\dots, W_{-1}, W_1,\dots, W_N$} from the middle token $W_0$.
16
17![image from paper on converting words to vectors](../14-Embeddings/images/example-algorithms-for-converting-words-to-vectors.png)
18
19> Image from [this paper](https://arxiv.org/pdf/1301.3781.pdf)
20
21## ✍️ Example Notebooks: Training CBoW model
22
23Continue your learning in the following notebooks:
24
25* [Training CBoW Word2Vec with TensorFlow](CBoW-TF.ipynb)
26
27## Conclusion
28
29In the previous lesson we have seen that words embeddings work like magic! Now we know that training word embeddings is not a very complex task, and we should be able to train our own word embeddings for domain specific text if needed.
30
31## [Post-lecture quiz](https://black-ground-0cc93280f.1.azurestaticapps.net/quiz/215)
32
33## Review & Self Study
34
35* [Official PyTorch tutorial on Language Modeling](https://pytorch.org/tutorials/beginner/nlp/word_embeddings_tutorial.html).
36* [Official TensorFlow tutorial on training Word2Vec model](https://www.TensorFlow.org/tutorials/text/word2vec).
37* Using the **gensim** framework to train most commonly used embeddings in a few lines of code is described [in this documentation](https://pytorch.org/tutorials/beginner/nlp/word_embeddings_tutorial.html).
38
39## 🚀 [Assignment: Train Skip-Gram Model](lab/README.md)
40
41In the lab, we challenge you to modify the code from this lesson to train skip-gram model instead of CBoW. [Read the details](lab/README.md)
42