microsoft/AI-For-Beginners

Public

mirrored fromhttps://github.com/microsoft/AI-For-BeginnersAvailable

CodeCommitsIssuesPull requestsActionsInsightsSecurity
9055907df3fb7071d169ef87a2340566e0f176e6

Branches

Tags

  • No tags available.
0Branches0Tags
Go to file
Add file
Code

Clone

HTTPS

Download ZIP

lessons/5-NLP/13-TextRep/TextRepresentationPyTorch.ipynb

564lines · modecode

1{
2 "cells": [
3 {
4 "cell_type": "markdown",
5 "metadata": {},
6 "source": [
7 "# Text classification task\n",
8 "\n",
9 "As we have mentioned, we will focus on simple text classification task based on **AG_NEWS** dataset, which is to classify news headlines into one of 4 categories: World, Sports, Business and Sci/Tech.\n",
10 "\n",
11 "## The Dataset\n",
12 "\n",
13 "This dataset is built into [`torchtext`](https://github.com/pytorch/text) module, so we can easily access it."
14 ]
15 },
16 {
17 "cell_type": "code",
18 "execution_count": 1,
19 "metadata": {},
20 "outputs": [],
21 "source": [
22 "import torch\n",
23 "import torchtext\n",
24 "import os\n",
25 "import collections\n",
26 "os.makedirs('./data',exist_ok=True)\n",
27 "train_dataset, test_dataset = torchtext.datasets.AG_NEWS(root='./data')\n",
28 "classes = ['World', 'Sports', 'Business', 'Sci/Tech']"
29 ]
30 },
31 {
32 "cell_type": "markdown",
33 "metadata": {},
34 "source": [
35 "Here, `train_dataset` and `test_dataset` contain collections that return pairs of label (number of class) and text respectively, for example:"
36 ]
37 },
38 {
39 "cell_type": "code",
40 "execution_count": 2,
41 "metadata": {},
42 "outputs": [
43 {
44 "data": {
45 "text/plain": [
46 "(3,\n",
47 " \"Wall St. Bears Claw Back Into the Black (Reuters) Reuters - Short-sellers, Wall Street's dwindling\\\\band of ultra-cynics, are seeing green again.\")"
48 ]
49 },
50 "execution_count": 2,
51 "metadata": {},
52 "output_type": "execute_result"
53 }
54 ],
55 "source": [
56 "list(train_dataset)[0]"
57 ]
58 },
59 {
60 "cell_type": "markdown",
61 "metadata": {},
62 "source": [
63 "So, let's print out the first 10 new headlines from our dataset: "
64 ]
65 },
66 {
67 "cell_type": "code",
68 "execution_count": 5,
69 "metadata": {},
70 "outputs": [
71 {
72 "name": "stdout",
73 "output_type": "stream",
74 "text": [
75 "**Sci/Tech** -> Wall St. Bears Claw Back Into the Black (Reuters) Reuters - Short-sellers, Wall Street's dwindling\\band of ultra-cynics, are seeing green again.\n",
76 "**Sci/Tech** -> Carlyle Looks Toward Commercial Aerospace (Reuters) Reuters - Private investment firm Carlyle Group,\\which has a reputation for making well-timed and occasionally\\controversial plays in the defense industry, has quietly placed\\its bets on another part of the market.\n",
77 "**Sci/Tech** -> Oil and Economy Cloud Stocks' Outlook (Reuters) Reuters - Soaring crude prices plus worries\\about the economy and the outlook for earnings are expected to\\hang over the stock market next week during the depth of the\\summer doldrums.\n",
78 "**Sci/Tech** -> Iraq Halts Oil Exports from Main Southern Pipeline (Reuters) Reuters - Authorities have halted oil export\\flows from the main pipeline in southern Iraq after\\intelligence showed a rebel militia could strike\\infrastructure, an oil official said on Saturday.\n",
79 "**Sci/Tech** -> Oil prices soar to all-time record, posing new menace to US economy (AFP) AFP - Tearaway world oil prices, toppling records and straining wallets, present a new economic menace barely three months before the US presidential elections.\n"
80 ]
81 }
82 ],
83 "source": [
84 "for i,x in zip(range(5),train_dataset):\n",
85 " print(f\"**{classes[x[0]]}** -> {x[1]}\")\n"
86 ]
87 },
88 {
89 "cell_type": "markdown",
90 "metadata": {},
91 "source": [
92 "Because datasets are iterators, if we want to use the data multiple times we need to convert it to list:"
93 ]
94 },
95 {
96 "cell_type": "code",
97 "execution_count": 3,
98 "metadata": {},
99 "outputs": [],
100 "source": [
101 "train_dataset, test_dataset = torchtext.datasets.AG_NEWS(root='./data')\n",
102 "train_dataset = list(train_dataset)\n",
103 "test_dataset = list(test_dataset)"
104 ]
105 },
106 {
107 "cell_type": "markdown",
108 "metadata": {},
109 "source": [
110 "## Tokenization\n",
111 "\n",
112 "Now we need to convert text into **numbers** that can be represented as tensors. If we want word-level representation, we need to do two things:\n",
113 "* use **tokenizer** to split text into **tokens**\n",
114 "* build a **vocabulary** of those tokens."
115 ]
116 },
117 {
118 "cell_type": "code",
119 "execution_count": 4,
120 "metadata": {},
121 "outputs": [
122 {
123 "data": {
124 "text/plain": [
125 "['he', 'said', 'hello']"
126 ]
127 },
128 "execution_count": 4,
129 "metadata": {},
130 "output_type": "execute_result"
131 }
132 ],
133 "source": [
134 "tokenizer = torchtext.data.utils.get_tokenizer('basic_english')\n",
135 "tokenizer('He said: hello')"
136 ]
137 },
138 {
139 "cell_type": "code",
140 "execution_count": 5,
141 "metadata": {},
142 "outputs": [],
143 "source": [
144 "counter = collections.Counter()\n",
145 "for (label, line) in train_dataset:\n",
146 " counter.update(tokenizer(line))\n",
147 "vocab = torchtext.vocab.vocab(counter, min_freq=1)"
148 ]
149 },
150 {
151 "cell_type": "markdown",
152 "metadata": {},
153 "source": [
154 "Using vocabulary, we can easily encode out tokenized string into a set of numbers:"
155 ]
156 },
157 {
158 "cell_type": "code",
159 "execution_count": 19,
160 "metadata": {},
161 "outputs": [
162 {
163 "name": "stdout",
164 "output_type": "stream",
165 "text": [
166 "Vocab size if 95810\n"
167 ]
168 },
169 {
170 "data": {
171 "text/plain": [
172 "[599, 3279, 97, 1220, 329, 225, 7368]"
173 ]
174 },
175 "execution_count": 19,
176 "metadata": {},
177 "output_type": "execute_result"
178 }
179 ],
180 "source": [
181 "vocab_size = len(vocab)\n",
182 "print(f\"Vocab size if {vocab_size}\")\n",
183 "\n",
184 "stoi = vocab.get_stoi() # dict to convert tokens to indices\n",
185 "\n",
186 "def encode(x):\n",
187 " return [stoi[s] for s in tokenizer(x)]\n",
188 "\n",
189 "encode('I love to play with my words')"
190 ]
191 },
192 {
193 "cell_type": "markdown",
194 "metadata": {},
195 "source": [
196 "## Bag of Words text representation\n",
197 "\n",
198 "Because words represent meaning, sometimes we can figure out the meaning of a text by just looking at the individual words, regardless of their order in the sentence. For example, when classifying news, words like *weather*, *snow* are likely to indicate *weather forecast*, while words like *stocks*, *dollar* would count towards *financial news*.\n",
199 "\n",
200 "**Bag of Words** (BoW) vector representation is the most commonly used traditional vector representation. Each word is linked to a vector index, vector element contains the number of occurrences of a word in a given document.\n",
201 "\n",
202 "![Image showing how a bag of words vector representation is represented in memory.](images/bag-of-words-example.png) \n",
203 "\n",
204 "> **Note**: You can also think of BoW as a sum of all one-hot-encoded vectors for individual words in the text.\n",
205 "\n",
206 "Below is an example of how to generate a bag of word representation using the Scikit Learn python library:"
207 ]
208 },
209 {
210 "cell_type": "code",
211 "execution_count": 7,
212 "metadata": {},
213 "outputs": [
214 {
215 "data": {
216 "text/plain": [
217 "array([[1, 1, 0, 2, 0, 0, 0, 0, 0]], dtype=int64)"
218 ]
219 },
220 "execution_count": 7,
221 "metadata": {},
222 "output_type": "execute_result"
223 }
224 ],
225 "source": [
226 "from sklearn.feature_extraction.text import CountVectorizer\n",
227 "vectorizer = CountVectorizer()\n",
228 "corpus = [\n",
229 " 'I like hot dogs.',\n",
230 " 'The dog ran fast.',\n",
231 " 'Its hot outside.',\n",
232 " ]\n",
233 "vectorizer.fit_transform(corpus)\n",
234 "vectorizer.transform(['My dog likes hot dogs on a hot day.']).toarray()"
235 ]
236 },
237 {
238 "cell_type": "markdown",
239 "metadata": {},
240 "source": [
241 "To compute bag-of-words vector from the vector representation of our AG_NEWS dataset, we can use the following function:"
242 ]
243 },
244 {
245 "cell_type": "code",
246 "execution_count": 20,
247 "metadata": {},
248 "outputs": [
249 {
250 "name": "stdout",
251 "output_type": "stream",
252 "text": [
253 "tensor([2., 1., 2., ..., 0., 0., 0.])\n"
254 ]
255 }
256 ],
257 "source": [
258 "vocab_size = len(vocab)\n",
259 "\n",
260 "def to_bow(text,bow_vocab_size=vocab_size):\n",
261 " res = torch.zeros(bow_vocab_size,dtype=torch.float32)\n",
262 " for i in encode(text):\n",
263 " if i<bow_vocab_size:\n",
264 " res[i] += 1\n",
265 " return res\n",
266 "\n",
267 "print(to_bow(train_dataset[0][1]))"
268 ]
269 },
270 {
271 "cell_type": "markdown",
272 "metadata": {},
273 "source": [
274 "> **Note:** Here we are using global `vocab_size` variable to specify default size of the vocabulary. Since often vocabulary size is pretty big, we can limit the size of the vocabulary to most frequent words. Try lowering `vocab_size` value and running the code below, and see how it affects the accuracy. You should expect some accuracy drop, but not dramatic, in lieu of higher performance."
275 ]
276 },
277 {
278 "cell_type": "markdown",
279 "metadata": {},
280 "source": [
281 "## Training BoW classifier\n",
282 "\n",
283 "Now that we have learned how to build Bag-of-Words representation of our text, let's train a classifier on top of it. First, we need to convert our dataset for training in such a way, that all positional vector representations are converted to bag-of-words representation. This can be achieved by passing `bowify` function as `collate_fn` parameter to standard torch `DataLoader`:"
284 ]
285 },
286 {
287 "cell_type": "code",
288 "execution_count": 21,
289 "metadata": {},
290 "outputs": [],
291 "source": [
292 "from torch.utils.data import DataLoader\n",
293 "import numpy as np \n",
294 "\n",
295 "# this collate function gets list of batch_size tuples, and needs to \n",
296 "# return a pair of label-feature tensors for the whole minibatch\n",
297 "def bowify(b):\n",
298 " return (\n",
299 " torch.LongTensor([t[0]-1 for t in b]),\n",
300 " torch.stack([to_bow(t[1]) for t in b])\n",
301 " )\n",
302 "\n",
303 "train_loader = DataLoader(train_dataset, batch_size=16, collate_fn=bowify, shuffle=True)\n",
304 "test_loader = DataLoader(test_dataset, batch_size=16, collate_fn=bowify, shuffle=True)"
305 ]
306 },
307 {
308 "cell_type": "markdown",
309 "metadata": {},
310 "source": [
311 "Now let's define a simple classifier neural network that contains one linear layer. The size of the input vector equals to `vocab_size`, and output size corresponds to the number of classes (4). Because we are solving classification task, the final activation function is `LogSoftmax()`."
312 ]
313 },
314 {
315 "cell_type": "code",
316 "execution_count": 22,
317 "metadata": {},
318 "outputs": [],
319 "source": [
320 "net = torch.nn.Sequential(torch.nn.Linear(vocab_size,4),torch.nn.LogSoftmax(dim=1))"
321 ]
322 },
323 {
324 "cell_type": "markdown",
325 "metadata": {},
326 "source": [
327 "Now we will define standard PyTorch training loop. Because our dataset is quite large, for our teaching purpose we will train only for one epoch, and sometimes even for less than an epoch (specifying the `epoch_size` parameter allows us to limit training). We would also report accumulated training accuracy during training; the frequency of reporting is specified using `report_freq` parameter."
328 ]
329 },
330 {
331 "cell_type": "code",
332 "execution_count": 24,
333 "metadata": {},
334 "outputs": [],
335 "source": [
336 "def train_epoch(net,dataloader,lr=0.01,optimizer=None,loss_fn = torch.nn.NLLLoss(),epoch_size=None, report_freq=200):\n",
337 " optimizer = optimizer or torch.optim.Adam(net.parameters(),lr=lr)\n",
338 " net.train()\n",
339 " total_loss,acc,count,i = 0,0,0,0\n",
340 " for labels,features in dataloader:\n",
341 " optimizer.zero_grad()\n",
342 " out = net(features)\n",
343 " loss = loss_fn(out,labels) #cross_entropy(out,labels)\n",
344 " loss.backward()\n",
345 " optimizer.step()\n",
346 " total_loss+=loss\n",
347 " _,predicted = torch.max(out,1)\n",
348 " acc+=(predicted==labels).sum()\n",
349 " count+=len(labels)\n",
350 " i+=1\n",
351 " if i%report_freq==0:\n",
352 " print(f\"{count}: acc={acc.item()/count}\")\n",
353 " if epoch_size and count>epoch_size:\n",
354 " break\n",
355 " return total_loss.item()/count, acc.item()/count"
356 ]
357 },
358 {
359 "cell_type": "code",
360 "execution_count": 25,
361 "metadata": {},
362 "outputs": [
363 {
364 "name": "stdout",
365 "output_type": "stream",
366 "text": [
367 "3200: acc=0.8028125\n",
368 "6400: acc=0.8371875\n",
369 "9600: acc=0.8534375\n",
370 "12800: acc=0.85765625\n"
371 ]
372 },
373 {
374 "data": {
375 "text/plain": [
376 "(0.026090790722161722, 0.8620069296375267)"
377 ]
378 },
379 "execution_count": 25,
380 "metadata": {},
381 "output_type": "execute_result"
382 }
383 ],
384 "source": [
385 "train_epoch(net,train_loader,epoch_size=15000)"
386 ]
387 },
388 {
389 "cell_type": "markdown",
390 "metadata": {},
391 "source": [
392 "## BiGrams, TriGrams and N-Grams\n",
393 "\n",
394 "One limitation of a bag of words approach is that some words are part of multi word expressions, for example, the word 'hot dog' has a completely different meaning than the words 'hot' and 'dog' in other contexts. If we represent words 'hot` and 'dog' always by the same vectors, it can confuse our model.\n",
395 "\n",
396 "To address this, **N-gram representations** are often used in methods of document classification, where the frequency of each word, bi-word or tri-word is a useful feature for training classifiers. In bigram representation, for example, we will add all word pairs to the vocabulary, in addition to original words. \n",
397 "\n",
398 "Below is an example of how to generate a bigram bag of word representation using the Scikit Learn:\n"
399 ]
400 },
401 {
402 "cell_type": "code",
403 "execution_count": 26,
404 "metadata": {},
405 "outputs": [
406 {
407 "name": "stdout",
408 "output_type": "stream",
409 "text": [
410 "Vocabulary:\n",
411 " {'i': 7, 'like': 11, 'hot': 4, 'dogs': 2, 'i like': 8, 'like hot': 12, 'hot dogs': 5, 'the': 16, 'dog': 0, 'ran': 14, 'fast': 3, 'the dog': 17, 'dog ran': 1, 'ran fast': 15, 'its': 9, 'outside': 13, 'its hot': 10, 'hot outside': 6}\n"
412 ]
413 },
414 {
415 "data": {
416 "text/plain": [
417 "array([[1, 0, 1, 0, 2, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]],\n",
418 " dtype=int64)"
419 ]
420 },
421 "execution_count": 26,
422 "metadata": {},
423 "output_type": "execute_result"
424 }
425 ],
426 "source": [
427 "bigram_vectorizer = CountVectorizer(ngram_range=(1, 2), token_pattern=r'\\b\\w+\\b', min_df=1)\n",
428 "corpus = [\n",
429 " 'I like hot dogs.',\n",
430 " 'The dog ran fast.',\n",
431 " 'Its hot outside.',\n",
432 " ]\n",
433 "bigram_vectorizer.fit_transform(corpus)\n",
434 "print(\"Vocabulary:\\n\",bigram_vectorizer.vocabulary_)\n",
435 "bigram_vectorizer.transform(['My dog likes hot dogs on a hot day.']).toarray()\n"
436 ]
437 },
438 {
439 "cell_type": "markdown",
440 "metadata": {},
441 "source": [
442 "The main drawback of N-gram approach is that vocabulary size starts to grow extremely fast. In practice, we need to combine N-gram representation with some dimensionality reduction techniques, such as *embeddings*, which we will discuss in the next unit.\n",
443 "\n",
444 "To use N-gram representation in our **AG News** dataset, we need to build special ngram vocabulary:"
445 ]
446 },
447 {
448 "cell_type": "code",
449 "execution_count": 27,
450 "metadata": {},
451 "outputs": [
452 {
453 "name": "stdout",
454 "output_type": "stream",
455 "text": [
456 "Bigram vocabulary length = 1308842\n"
457 ]
458 }
459 ],
460 "source": [
461 "counter = collections.Counter()\n",
462 "for (label, line) in train_dataset:\n",
463 " l = tokenizer(line)\n",
464 " counter.update(torchtext.data.utils.ngrams_iterator(l,ngrams=2))\n",
465 " \n",
466 "bi_vocab = torchtext.vocab.vocab(counter, min_freq=1)\n",
467 "\n",
468 "print(\"Bigram vocabulary length = \",len(bi_vocab))"
469 ]
470 },
471 {
472 "cell_type": "markdown",
473 "metadata": {},
474 "source": [
475 "We could then use the same code as above to train the classifier, however, it would be very memory-inefficient. In the next unit, we will train bigram classifier using embeddings.\n",
476 "\n",
477 "> **Note:** You can only leave those ngrams that occur in the text more than specified number of times. This will make sure that infrequent bigrams will be omitted, and will decrease the dimensionality significantly. To do this, set `min_freq` parameter to a higher value, and observe the length of vocabulary change."
478 ]
479 },
480 {
481 "cell_type": "markdown",
482 "metadata": {},
483 "source": [
484 "## Term Frequency Inverse Document Frequency TF-IDF\n",
485 "\n",
486 "In BoW representation, word occurrences are evenly weighted, regardless of the word itself. However, it is clear that frequent words, such as *a*, *in*, etc. are much less important for the classification, than specialized terms. In fact, in most NLP tasks some words are more relevant than others.\n",
487 "\n",
488 "**TF-IDF** stands for **term frequency–inverse document frequency**. It is a variation of bag of words, where instead of a binary 0/1 value indicating the appearance of a word in a document, a floating-point value is used, which is related to the frequency of word occurrence in the corpus.\n",
489 "\n",
490 "More formally, the weight $w_{ij}$ of a word $i$ in the document $j$ is defined as:\n",
491 "$$\n",
492 "w_{ij} = tf_{ij}\\times\\log({N\\over df_i})\n",
493 "$$\n",
494 "where\n",
495 "* $tf_{ij}$ is the number of occurrences of $i$ in $j$, i.e. the BoW value we have seen before\n",
496 "* $N$ is the number of documents in the collection\n",
497 "* $df_i$ is the number of documents containing the word $i$ in the whole collection\n",
498 "\n",
499 "TF-IDF value $w_{ij}$ increases proportionally to the number of times a word appears in a document and is offset by the number of documents in the corpus that contains the word, which helps to adjust for the fact that some words appear more frequently than others. For example, if the word appears in *every* document in the collection, $df_i=N$, and $w_{ij}=0$, and those terms would be completely disregarded.\n",
500 "\n",
501 "You can easily create TF-IDF vectorization of text using Scikit Learn:"
502 ]
503 },
504 {
505 "cell_type": "code",
506 "execution_count": 28,
507 "metadata": {},
508 "outputs": [
509 {
510 "data": {
511 "text/plain": [
512 "array([[0.43381609, 0. , 0.43381609, 0. , 0.65985664,\n",
513 " 0.43381609, 0. , 0. , 0. , 0. ,\n",
514 " 0. , 0. , 0. , 0. , 0. ,\n",
515 " 0. ]])"
516 ]
517 },
518 "execution_count": 28,
519 "metadata": {},
520 "output_type": "execute_result"
521 }
522 ],
523 "source": [
524 "from sklearn.feature_extraction.text import TfidfVectorizer\n",
525 "vectorizer = TfidfVectorizer(ngram_range=(1,2))\n",
526 "vectorizer.fit_transform(corpus)\n",
527 "vectorizer.transform(['My dog likes hot dogs on a hot day.']).toarray()"
528 ]
529 },
530 {
531 "cell_type": "markdown",
532 "metadata": {},
533 "source": [
534 "## Conclusion \n",
535 "\n",
536 "However even though TF-IDF representations provide frequency weight to different words they are unable to represent meaning or order. As the famous linguist J. R. Firth said in 1935, “The complete meaning of a word is always contextual, and no study of meaning apart from context can be taken seriously.”. We will learn later in the course how to capture contextual information from text using language modeling.\n"
537 ]
538 }
539 ],
540 "metadata": {
541 "interpreter": {
542 "hash": "16af2a8bbb083ea23e5e41c7f5787656b2ce26968575d8763f2c4b17f9cd711f"
543 },
544 "kernelspec": {
545 "display_name": "Python 3.8.12 ('py38')",
546 "language": "python",
547 "name": "python3"
548 },
549 "language_info": {
550 "codemirror_mode": {
551 "name": "ipython",
552 "version": 3
553 },
554 "file_extension": ".py",
555 "mimetype": "text/x-python",
556 "name": "python",
557 "nbconvert_exporter": "python",
558 "pygments_lexer": "ipython3",
559 "version": "3.8.12"
560 }
561 },
562 "nbformat": 4,
563 "nbformat_minor": 2
564}
565