microsoft/AI-For-Beginners

Public

mirrored fromhttps://github.com/microsoft/AI-For-BeginnersAvailable

CodeCommitsIssuesPull requestsActionsInsightsSecurity
5d97797124da3144ff63633bbe57da7df800b6bc

Branches

Tags

  • No tags available.
0Branches0Tags
Go to file
Add file
Code

Clone

HTTPS

Download ZIP

lessons/5-NLP/13-TextRep/README.md

52lines · modecode

1# Representing Text as Tensors
2
3## Text Classification
4
5Throughout the first part of this course, we will focus on **text classification** task. We will use [AG News](https://www.kaggle.com/amananandrai/ag-news-classification-dataset) Dataset, which contains news articles like the following:
6
7* Category: Sci/Tech
8* Title: Ky. Company Wins Grant to Study Peptides (AP)
9* Body: AP - A company founded by a chemistry researcher at the University of Louisville won a grant to develop...
10
11Our goal would be to classify the news item into one of the categories based on text.
12
13## Representing text
14
15If we want to solve Natural Language Processing (NLP) tasks with neural networks, we need some way to represent text as tensors. Computers already represent textual characters as numbers that map to fonts on your screen using encodings such as ASCII or UTF-8.
16
17<img alt="Image showing diagram mapping a character to an ASCII and binary representation" src="images/ascii-character-map.png" width="50%"/>
18
19> [Image source](https://www.seobility.net/en/wiki/ASCII)
20
21We understand what each letter **represents**, and how all characters come together to form the words of a sentence. However, computers by themselves do not have such an understanding, and neural network has to learn the meaning during training.
22
23Therefore, we can use different approaches when representing text:
24* **Character-level representation**, when we represent text by treating each character as a number. Given that we have *C* different characters in our text corpus, the word *Hello* would be represented by 5x*C* tensor. Each letter would correspond to a tensor column in one-hot encoding.
25* **Word-level representation**, in which we create a **vocabulary** of all words in our text, and then represent words using one-hot encoding. This approach is somehow better, because each letter by itself does not have much meaning, and thus by using higher-level semantic concepts - words - we simplify the task for the neural network. However, given large dictionary size, we need to deal with high-dimensional sparse tensors.
26
27Regardless of the representation, we first need to convert text into a sequence of **tokens**, one token being either a character, a word, or sometimes even part of a word. Then, we convert token into a number, typically using **vocabulary**, and this number can be fed into a neural network using one-hot encoding.
28
29## N-Grams
30
31In natural language, precise meaning of words can only be determined in context. For example, meanings of *neural network* and *fishing network* are completely different. One of the ways to take this into account is to build our model on pairs of words, and considering word pairs as separate vocabulary tokens. In this way, the sentence *I like to go fishing* will be represented by the following sequence of tokens: *I like*, *like to*, *to go*, *go fishing*. The problem with this approach is that the dictionary size grows significantly, and combinations like *go fishing* and *go shopping* are presented by different tokens, which do not share any semantic similarity despite the same verb.
32
33In some cases, we may consider using tri-grams -- combinations of three words -- as well. Thus the approach is such is often called **n-grams**. Also, it makes sense to use n-grams with character-level representation, in which case n-grams will roughly correspond to different syllabi.
34
35## Bag-of-Words and TF/IDF
36
37When solving tasks like text classification, we need to be able to represent text by one fixed-size vector, which we will use as an input to final dense classifier. One of the simplest ways to do that is to combine all individual word representations, eg. by adding them. If we add one-hot encodings of each word, we will end up with a vector of frequencies, showing how many times each word appears inside the text. Such representation of text is called **bag of words** (BOW).
38
39<img src="images/bow.png" width="90%"/>
40
41> Image by author
42
43BOW essentially represents which words appear in text and in which quantities, which can indeed be a good indication of what the text is about. For example, news article on politics is likely to contains words such as *president* and *country*, while scientific publication would have something like *collider*, *discovered*, etc. Thus, word frequencies can in many cases be a good indicator of text content.
44
45The problem with BOW is that certain common words, such as *and*, *is*, etc. appear in most of the texts, and they have highest frequencies, masking out the words that are really important. We may lower the importance of those words by taking into account the frequency at which words occur in the whole document collection. This is the main idea behind TF/IDF approach, which is covered in more detail in the notebooks below.
46
47However, none of those approaches can fully take into account the semantics of text. We need more powerful neural networks models, which we will discuss later in this course.
48
49## Continue to Notebooks
50
51* [Text Representation with PyTorch](TextRepresentationPyTorch.ipynb)
52* [Text Representation with TensorFlow](TextRepresentationTF.ipynb)
53