microsoft/AI-For-Beginners

Public

mirrored fromhttps://github.com/microsoft/AI-For-BeginnersAvailable

Watch0 Fork0 Star0

Code Commits Issues Pull requests Actions Insights Security

88034d51145c0d8bc71779cf98037daab83275c8

Find a branch or tag

Branches

88034d51145c0d8bc71779cf98037daab83275c8

Clone

HTTPS

Download ZIP

AI-For-Beginners/lessons/5-NLP/15-LanguageModeling

lessons/5-NLP/15-LanguageModeling/README.md

41lines · modecode

Raw Download

Latest commit unavailable.

unknown

1	`# Language Modeling`
2
3	`Semantic embeddings, such as Word2Vec and GloVe, are in fact a first step towards language modeling - creating models that somehow understand (or represent) the nature of the language.`
4
5	`## [Pre-lecture quiz](https://black-ground-0cc93280f.1.azurestaticapps.net/quiz/115)`
6
7	`The main idea behind language modeling is training them on unlabeled datasets in an unsupervised manner. This is important because we have huge amounts of unlabeled text available, while the amount of labeled text would always be limited by the amount of effort we can spend on labeling. Most often, we can build language models that can predict missing words in the text, because it is easy to mask out a random word in text and use it as a training sample.`
8
9	`## Training Embeddings`
10
11	`In our previous examples, we used pre-trained semantic embeddings, but it is interesting to see how those embeddings can be trained. There are several possible ideas the can be used:`
12
13	`* N-Gram language modeling, when we predict a token by looking at N previous tokens (N-gram)`
14	`* Continuous Bag-of-Words (CBoW), when we predict the middle token $W_0$ in a token sequence $W_{-N}$, ..., $W_N$.`
15	`* Skip-gram, where we predict a set of neighboring tokens {$W_{-N},\dots, W_{-1}, W_1,\dots, W_N$} from the middle token $W_0$.`
16
17	`![image from paper on converting words to vectors](../14-Embeddings/images/example-algorithms-for-converting-words-to-vectors.png)`
18
19	`> Image from [this paper](https://arxiv.org/pdf/1301.3781.pdf)`
20
21	`## ✍️ Example Notebooks: Training CBoW model`
22
23	`Continue your learning in the following notebooks:`
24
25	`* [Training CBoW Word2Vec with TensorFlow](CBoW-TF.ipynb)`
26
27	`## Conclusion`
28
29	`In the previous lesson we have seen that words embeddings work like magic! Now we know that training word embeddings is not a very complex task, and we should be able to train our own word embeddings for domain specific text if needed.`
30
31	`## [Post-lecture quiz](https://black-ground-0cc93280f.1.azurestaticapps.net/quiz/215)`
32
33	`## Review & Self Study`
34
35	`* [Official PyTorch tutorial on Language Modeling](https://pytorch.org/tutorials/beginner/nlp/word_embeddings_tutorial.html).`
36	`* [Official TensorFlow tutorial on training Word2Vec model](https://www.TensorFlow.org/tutorials/text/word2vec).`
37	`* Using the gensim framework to train most commonly used embeddings in a few lines of code is described [in this documentation](https://pytorch.org/tutorials/beginner/nlp/word_embeddings_tutorial.html).`
38
39	`## 🚀 [Assignment: Train Skip-Gram Model](lab/README.md)`
40
41	`In the lab, we challenge you to modify the code from this lesson to train skip-gram model instead of CBoW. [Read the details](lab/README.md)`
42

microsoft/AI-For-Beginners

Branches

Tags

Clone