microsoft/AI-For-Beginners

Public

mirrored fromhttps://github.com/microsoft/AI-For-BeginnersAvailable

Watch0 Fork0 Star0

Code Commits Issues Pull requests Actions Insights Security

5d97797124da3144ff63633bbe57da7df800b6bc

Find a branch or tag

Branches

5d97797124da3144ff63633bbe57da7df800b6bc

Clone

HTTPS

Download ZIP

AI-For-Beginners/lessons/5-NLP/13-TextRep

lessons/5-NLP/13-TextRep/README.md

52lines · modecode

Raw Download

Latest commit unavailable.

unknown

1	`# Representing Text as Tensors`
2
3	`## Text Classification`
4
5	`Throughout the first part of this course, we will focus on text classification task. We will use [AG News](https://www.kaggle.com/amananandrai/ag-news-classification-dataset) Dataset, which contains news articles like the following:`
6
7	`* Category: Sci/Tech`
8	`* Title: Ky. Company Wins Grant to Study Peptides (AP)`
9	`* Body: AP - A company founded by a chemistry researcher at the University of Louisville won a grant to develop...`
10
11	`Our goal would be to classify the news item into one of the categories based on text.`
12
13	`## Representing text`
14
15	`If we want to solve Natural Language Processing (NLP) tasks with neural networks, we need some way to represent text as tensors. Computers already represent textual characters as numbers that map to fonts on your screen using encodings such as ASCII or UTF-8.`
16
17	`<img alt="Image showing diagram mapping a character to an ASCII and binary representation" src="images/ascii-character-map.png" width="50%"/>`
18
19	`> [Image source](https://www.seobility.net/en/wiki/ASCII)`
20
21	`We understand what each letter represents, and how all characters come together to form the words of a sentence. However, computers by themselves do not have such an understanding, and neural network has to learn the meaning during training.`
22
23	`Therefore, we can use different approaches when representing text:`
24	`* Character-level representation, when we represent text by treating each character as a number. Given that we have C different characters in our text corpus, the word Hello would be represented by 5xC tensor. Each letter would correspond to a tensor column in one-hot encoding.`
25	`* Word-level representation, in which we create a vocabulary of all words in our text, and then represent words using one-hot encoding. This approach is somehow better, because each letter by itself does not have much meaning, and thus by using higher-level semantic concepts - words - we simplify the task for the neural network. However, given large dictionary size, we need to deal with high-dimensional sparse tensors.`
26
27	`Regardless of the representation, we first need to convert text into a sequence of tokens, one token being either a character, a word, or sometimes even part of a word. Then, we convert token into a number, typically using vocabulary, and this number can be fed into a neural network using one-hot encoding.`
28
29	`## N-Grams`
30
31	In natural language, precise meaning of words can only be determined in context. For example, meanings of neural network and fishing network are completely different. One of the ways to take this into account is to build our model on pairs of words, and considering word pairs as separate vocabulary tokens. In this way, the sentence I like to go fishing will be represented by the following sequence of tokens: I like, like to, to go, go fishing. The problem with this approach is that the dictionary size grows significantly, and combinations like go fishing and go shopping are presented by different tokens, which do not share any semantic similarity despite the same verb.
32
33	`In some cases, we may consider using tri-grams -- combinations of three words -- as well. Thus the approach is such is often called n-grams. Also, it makes sense to use n-grams with character-level representation, in which case n-grams will roughly correspond to different syllabi.`
34
35	`## Bag-of-Words and TF/IDF`
36
37	`When solving tasks like text classification, we need to be able to represent text by one fixed-size vector, which we will use as an input to final dense classifier. One of the simplest ways to do that is to combine all individual word representations, eg. by adding them. If we add one-hot encodings of each word, we will end up with a vector of frequencies, showing how many times each word appears inside the text. Such representation of text is called bag of words (BOW).`
38
39	`<img src="images/bow.png" width="90%"/>`
40
41	`> Image by author`
42
43	`BOW essentially represents which words appear in text and in which quantities, which can indeed be a good indication of what the text is about. For example, news article on politics is likely to contains words such as president and country, while scientific publication would have something like collider, discovered, etc. Thus, word frequencies can in many cases be a good indicator of text content.`
44
45	`The problem with BOW is that certain common words, such as and, is, etc. appear in most of the texts, and they have highest frequencies, masking out the words that are really important. We may lower the importance of those words by taking into account the frequency at which words occur in the whole document collection. This is the main idea behind TF/IDF approach, which is covered in more detail in the notebooks below.`
46
47	`However, none of those approaches can fully take into account the semantics of text. We need more powerful neural networks models, which we will discuss later in this course.`
48
49	`## Continue to Notebooks`
50
51	`* [Text Representation with PyTorch](TextRepresentationPyTorch.ipynb)`
52	`* [Text Representation with TensorFlow](TextRepresentationTF.ipynb)`
53

microsoft/AI-For-Beginners

Branches

Tags

Clone