microsoft/AI-For-Beginners

Public

mirrored fromhttps://github.com/microsoft/AI-For-BeginnersAvailable

Watch0 Fork0 Star0

Code Commits Issues Pull requests Actions Insights Security

88034d51145c0d8bc71779cf98037daab83275c8

Find a branch or tag

Branches

88034d51145c0d8bc71779cf98037daab83275c8

Clone

HTTPS

Download ZIP

AI-For-Beginners/lessons/5-NLP/13-TextRep

lessons/5-NLP/13-TextRep/TextRepresentationTF.ipynb

635lines · modecode

Raw Download

Latest commit unavailable.

unknown

1	`{`
2	`"cells": [`
3	`{`
4	`"cell_type": "markdown",`
5	`"metadata": {},`
6	`"source": [`
7	`"# Text classification task\n",`
8	`"\n",`
9	`"In this module, we will start with a simple text classification task based on the [AG_NEWS](http://www.di.unipi.it/~gulli/AG_corpus_of_news_articles.html) dataset: we'll classify news headlines into one of 4 categories: World, Sports, Business and Sci/Tech. \n",`
10	`"\n",`
11	`"## The Dataset\n",`
12	`"\n",`
13	`"To load the dataset, we will use the [TensorFlow Datasets](https://www.tensorflow.org/datasets) API."`
14	`]`
15	`},`
16	`{`
17	`"cell_type": "code",`
18	`"execution_count": 1,`
19	`"metadata": {},`
20	`"outputs": [],`
21	`"source": [`
22	`"import tensorflow as tf\n",`
23	`"from tensorflow import keras\n",`
24	`"import tensorflow_datasets as tfds\n",`
25	`"\n",`
26	`"# In this tutorial, we will be training a lot of models. In order to use GPU memory cautiously,\n",`
27	`"# we will set tensorflow option to grow GPU memory allocation when required.\n",`
28	`"physical_devices = tf.config.list_physical_devices('GPU') \n",`
29	`"if len(physical_devices)>0:\n",`
30	`" tf.config.experimental.set_memory_growth(physical_devices[0], True)\n",`
31	`"\n",`
32	`"dataset = tfds.load('ag_news_subset')"`
33	`]`
34	`},`
35	`{`
36	`"cell_type": "markdown",`
37	`"metadata": {},`
38	`"source": [`
39	"We can now access the training and test portions of the dataset by using `dataset['train']` and `dataset['test']` respectively:"
40	`]`
41	`},`
42	`{`
43	`"cell_type": "code",`
44	`"execution_count": 3,`
45	`"metadata": {},`
46	`"outputs": [`
47	`{`
48	`"name": "stdout",`
49	`"output_type": "stream",`
50	`"text": [`
51	`"Length of train dataset = 120000\n",`
52	`"Length of test dataset = 7600\n"`
53	`]`
54	`}`
55	`],`
56	`"source": [`
57	`"ds_train = dataset['train']\n",`
58	`"ds_test = dataset['test']\n",`
59	`"\n",`
60	`"print(f\"Length of train dataset = {len(ds_train)}\")\n",`
61	`"print(f\"Length of test dataset = {len(ds_test)}\")"`
62	`]`
63	`},`
64	`{`
65	`"cell_type": "markdown",`
66	`"metadata": {},`
67	`"source": [`
68	`"Let's print out the first 10 new headlines from our dataset: "`
69	`]`
70	`},`
71	`{`
72	`"cell_type": "code",`
73	`"execution_count": 4,`
74	`"metadata": {},`
75	`"outputs": [`
76	`{`
77	`"name": "stdout",`
78	`"output_type": "stream",`
79	`"text": [`
80	`"3 (Sci/Tech) -> b'AMD Debuts Dual-Core Opteron Processor' b'AMD #39;s new dual-core Opteron chip is designed mainly for corporate computing applications, including databases, Web services, and financial transactions.'\n",`
81	`"1 (Sports) -> b\"Wood's Suspension Upheld (Reuters)\" b'Reuters - Major League Baseball\\\\Monday announced a decision on the appeal filed by Chicago Cubs\\\\pitcher Kerry Wood regarding a suspension stemming from an\\\\incident earlier this season.'\n",`
82	`"2 (Business) -> b'Bush reform may have blue states seeing red' b'President Bush #39;s quot;revenue-neutral quot; tax reform needs losers to balance its winners, and people claiming the federal deduction for state and local taxes may be in administration planners #39; sights, news reports say.'\n",`
83	`"3 (Sci/Tech) -> b\"'Halt science decline in schools'\" b'Britain will run out of leading scientists unless science education is improved, says Professor Colin Pillinger.'\n",`
84	`"1 (Sports) -> b'Gerrard leaves practice' b'London, England (Sports Network) - England midfielder Steven Gerrard injured his groin late in Thursday #39;s training session, but is hopeful he will be ready for Saturday #39;s World Cup qualifier against Austria.'\n"`
85	`]`
86	`}`
87	`],`
88	`"source": [`
89	`"classes = ['World', 'Sports', 'Business', 'Sci/Tech']\n",`
90	`"\n",`
91	`"for i,x in zip(range(5),ds_train):\n",`
92	`" print(f\"{x['label']} ({classes[x['label']]}) -> {x['title']} {x['description']}\")"`
93	`]`
94	`},`
95	`{`
96	`"cell_type": "markdown",`
97	`"metadata": {},`
98	`"source": [`
99	`"## Text vectorization\n",`
100	`"\n",`
101	`"Now we need to convert text into numbers that can be represented as tensors. If we want word-level representation, we need to do two things:\n",`
102	`"\n",`
103	`"* Use a tokenizer to split text into tokens.\n",`
104	`"* Build a vocabulary of those tokens.\n",`
105	`"\n",`
106	`"### Limiting vocabulary size\n",`
107	`"\n",`
108	`"In the AG News dataset example, the vocabulary size is rather big, more than 100k words. Generally speaking, we don't need words that are rarely present in the text — only a few sentences will have them, and the model will not learn from them. Thus, it makes sense to limit the vocabulary size to a smaller number by passing an argument to the vectorizer constructor:\n",`
109	`"\n",`
110	"Both of those steps can be handled using the TextVectorization layer. Let's instantiate the vectorizer object, and then call the `adapt` method to go through all text and build a vocabulary:\n",
111	`"\n"`
112	`]`
113	`},`
114	`{`
115	`"cell_type": "code",`
116	`"execution_count": 5,`
117	`"metadata": {},`
118	`"outputs": [],`
119	`"source": [`
120	`"vocab_size = 50000\n",`
121	`"vectorizer = keras.layers.experimental.preprocessing.TextVectorization(max_tokens=vocab_size)\n",`
122	`"vectorizer.adapt(ds_train.take(500).map(lambda x: x['title']+' '+x['description']))"`
123	`]`
124	`},`
125	`{`
126	`"cell_type": "markdown",`
127	`"metadata": {},`
128	`"source": [`
129	"> Note that we are using only subset of the whole dataset to build a vocabulary. We do it to speed up the execution time and not keep you waiting. However, we are taking the risk that some of the words from the whole dateset would not be included into the vocabulary, and will be ignored during training. Thus, using the whole vocabulary size and running through all dataset during `adapt` should increase the final accuracy, but not significantly.\n",
130	`"\n",`
131	`"Now we can access the actual vocabulary:"`
132	`]`
133	`},`
134	`{`
135	`"cell_type": "code",`
136	`"execution_count": 6,`
137	`"metadata": {},`
138	`"outputs": [`
139	`{`
140	`"name": "stdout",`
141	`"output_type": "stream",`
142	`"text": [`
143	`"['', '[UNK]', 'the', 'to', 'a', 'in', 'of', 'and', 'on', 'for']\n",`
144	`"Length of vocabulary: 5335\n"`
145	`]`
146	`}`
147	`],`
148	`"source": [`
149	`"vocab = vectorizer.get_vocabulary()\n",`
150	`"vocab_size = len(vocab)\n",`
151	`"print(vocab[:10])\n",`
152	`"print(f\"Length of vocabulary: {vocab_size}\")"`
153	`]`
154	`},`
155	`{`
156	`"cell_type": "markdown",`
157	`"metadata": {},`
158	`"source": [`
159	`"Using the vectorizer, we can easily encode any text into a set of numbers:"`
160	`]`
161	`},`
162	`{`
163	`"cell_type": "code",`
164	`"execution_count": 7,`
165	`"metadata": {},`
166	`"outputs": [`
167	`{`
168	`"data": {`
169	`"text/plain": [`
170	`"<tf.Tensor: shape=(7,), dtype=int64, numpy=array([ 112, 3695, 3, 304, 11, 1041, 1], dtype=int64)>"`
171	`]`
172	`},`
173	`"execution_count": 7,`
174	`"metadata": {},`
175	`"output_type": "execute_result"`
176	`}`
177	`],`
178	`"source": [`
179	`"vectorizer('I love to play with my words')"`
180	`]`
181	`},`
182	`{`
183	`"cell_type": "markdown",`
184	`"metadata": {},`
185	`"source": [`
186	`"## Bag-of-words text representation\n",`
187	`"\n",`
188	`"Because words represent meaning, sometimes we can figure out the meaning of a piece of text by just looking at the individual words, regardless of their order in the sentence. For example, when classifying news, words like weather and snow are likely to indicate weather forecast, while words like stocks and dollar would count towards financial news.\n",`
189	`"\n",`
190	`"Bag-of-words (BoW) vector representation is the most simple to understand traditional vector representation. Each word is linked to a vector index, and a vector element contains the number of occurrences of each word in a given document.\n",`
191	`"\n",`
192	`"![Image showing how a bag of words vector representation is represented in memory.](images/bag-of-words-example.png) \n",`
193	`"\n",`
194	`"> Note: You can also think of BoW as a sum of all one-hot-encoded vectors for individual words in the text.\n",`
195	`"\n",`
196	`"Below is an example of how to generate a bag-of-words representation using the Scikit Learn python library:"`
197	`]`
198	`},`
199	`{`
200	`"cell_type": "code",`
201	`"execution_count": 8,`
202	`"metadata": {},`
203	`"outputs": [`
204	`{`
205	`"data": {`
206	`"text/plain": [`
207	`"array([[1, 1, 0, 2, 0, 0, 0, 0, 0]], dtype=int64)"`
208	`]`
209	`},`
210	`"execution_count": 8,`
211	`"metadata": {},`
212	`"output_type": "execute_result"`
213	`}`
214	`],`
215	`"source": [`
216	`"from sklearn.feature_extraction.text import CountVectorizer\n",`
217	`"sc_vectorizer = CountVectorizer()\n",`
218	`"corpus = [\n",`
219	`" 'I like hot dogs.',\n",`
220	`" 'The dog ran fast.',\n",`
221	`" 'Its hot outside.',\n",`
222	`" ]\n",`
223	`"sc_vectorizer.fit_transform(corpus)\n",`
224	`"sc_vectorizer.transform(['My dog likes hot dogs on a hot day.']).toarray()"`
225	`]`
226	`},`
227	`{`
228	`"cell_type": "markdown",`
229	`"metadata": {},`
230	`"source": [`
231	`"We can also use the Keras vectorizer that we defined above, converting each word number into a one-hot encoding and adding all those vectors up:"`
232	`]`
233	`},`
234	`{`
235	`"cell_type": "code",`
236	`"execution_count": 9,`
237	`"metadata": {},`
238	`"outputs": [`
239	`{`
240	`"data": {`
241	`"text/plain": [`
242	`"array([0., 5., 0., ..., 0., 0., 0.], dtype=float32)"`
243	`]`
244	`},`
245	`"execution_count": 9,`
246	`"metadata": {},`
247	`"output_type": "execute_result"`
248	`}`
249	`],`
250	`"source": [`
251	`"def to_bow(text):\n",`
252	`" return tf.reduce_sum(tf.one_hot(vectorizer(text),vocab_size),axis=0)\n",`
253	`"\n",`
254	`"to_bow('My dog likes hot dogs on a hot day.').numpy()"`
255	`]`
256	`},`
257	`{`
258	`"cell_type": "markdown",`
259	`"metadata": {},`
260	`"source": [`
261	`"> Note: You may be surprised that the result differs from the previous example. The reason is that in the Keras example the length of the vector corresponds to the vocabulary size, which was built from the whole AG News dataset, while in the Scikit Learn example we built the vocabulary from the sample text on the fly. \n"`
262	`]`
263	`},`
264	`{`
265	`"cell_type": "markdown",`
266	`"metadata": {},`
267	`"source": [`
268	`"## Training the BoW classifier\n",`
269	`"\n",`
270	"Now that we have learned how to build the bag-of-words representation of our text, let's train a classifier that uses it. First, we need to convert our dataset to a bag-of-words representation. This can be achieved by using `map` function in the following way:"
271	`]`
272	`},`
273	`{`
274	`"cell_type": "code",`
275	`"execution_count": 11,`
276	`"metadata": {},`
277	`"outputs": [],`
278	`"source": [`
279	`"batch_size = 128\n",`
280	`"\n",`
281	`"ds_train_bow = ds_train.map(lambda x: (to_bow(x['title']+x['description']),x['label'])).batch(batch_size)\n",`
282	`"ds_test_bow = ds_test.map(lambda x: (to_bow(x['title']+x['description']),x['label'])).batch(batch_size)"`
283	`]`
284	`},`
285	`{`
286	`"cell_type": "markdown",`
287	`"metadata": {},`
288	`"source": [`
289	"Now let's define a simple classifier neural network that contains one linear layer. The input size is `vocab_size`, and the output size corresponds to the number of classes (4). Because we're solving a classification task, the final activation function is softmax:"
290	`]`
291	`},`
292	`{`
293	`"cell_type": "code",`
294	`"execution_count": 12,`
295	`"metadata": {},`
296	`"outputs": [`
297	`{`
298	`"name": "stdout",`
299	`"output_type": "stream",`
300	`"text": [`
301	`"938/938 [==============================] - 66s 70ms/step - loss: 0.6144 - acc: 0.8427 - val_loss: 0.4416 - val_acc: 0.8697\n"`
302	`]`
303	`},`
304	`{`
305	`"data": {`
306	`"text/plain": [`
307	`"<keras.callbacks.History at 0x20c70a947f0>"`
308	`]`
309	`},`
310	`"execution_count": 12,`
311	`"metadata": {},`
312	`"output_type": "execute_result"`
313	`}`
314	`],`
315	`"source": [`
316	`"model = keras.models.Sequential([\n",`
317	`" keras.layers.Dense(4,activation='softmax',input_shape=(vocab_size,))\n",`
318	`"])\n",`
319	`"model.compile(loss='sparse_categorical_crossentropy',optimizer='adam',metrics=['acc'])\n",`
320	`"model.fit(ds_train_bow,validation_data=ds_test_bow)"`
321	`]`
322	`},`
323	`{`
324	`"cell_type": "markdown",`
325	`"metadata": {},`
326	`"source": [`
327	`"Since we have 4 classes, an accuracy of above 80% is a good result.\n",`
328	`"\n",`
329	`"## Training a classifier as one network\n",`
330	`"\n",`
331	"Because the vectorizer is also a Keras layer, we can define a network that includes it, and train it end-to-end. This way we don't need to vectorize the dataset using `map`, we can just pass the original dataset to the input of the network.\n",
332	`"\n",`
333	"> Note: We would still have to apply maps to our dataset to convert fields from dictionaries (such as `title`, `description` and `label`) to tuples. However, when loading data from disk, we can build a dataset with the required structure in the first place."
334	`]`
335	`},`
336	`{`
337	`"cell_type": "code",`
338	`"execution_count": 13,`
339	`"metadata": {},`
340	`"outputs": [`
341	`{`
342	`"name": "stdout",`
343	`"output_type": "stream",`
344	`"text": [`
345	`"Model: \"model\"\n",`
346	`"_________________________________________________________________\n",`
347	`" Layer (type) Output Shape Param # \n",`
348	`"=================================================================\n",`
349	`" input_1 (InputLayer) [(None, 1)] 0 \n",`
350	`" \n",`
351	`" text_vectorization (TextVec (None, None) 0 \n",`
352	`" torization) \n",`
353	`" \n",`
354	`" tf.one_hot (TFOpLambda) (None, None, 5335) 0 \n",`
355	`" \n",`
356	`" tf.math.reduce_sum (TFOpLam (None, 5335) 0 \n",`
357	`" bda) \n",`
358	`" \n",`
359	`" dense_2 (Dense) (None, 4) 21344 \n",`
360	`" \n",`
361	`"=================================================================\n",`
362	`"Total params: 21,344\n",`
363	`"Trainable params: 21,344\n",`
364	`"Non-trainable params: 0\n",`
365	`"_________________________________________________________________\n",`
366	`"938/938 [==============================] - 73s 77ms/step - loss: 0.6057 - acc: 0.8414 - val_loss: 0.4202 - val_acc: 0.8736\n"`
367	`]`
368	`},`
369	`{`
370	`"data": {`
371	`"text/plain": [`
372	`"<keras.callbacks.History at 0x20c721521f0>"`
373	`]`
374	`},`
375	`"execution_count": 13,`
376	`"metadata": {},`
377	`"output_type": "execute_result"`
378	`}`
379	`],`
380	`"source": [`
381	`"def extract_text(x):\n",`
382	`" return x['title']+' '+x['description']\n",`
383	`"\n",`
384	`"def tupelize(x):\n",`
385	`" return (extract_text(x),x['label'])\n",`
386	`"\n",`
387	`"inp = keras.Input(shape=(1,),dtype=tf.string)\n",`
388	`"x = vectorizer(inp)\n",`
389	`"x = tf.reduce_sum(tf.one_hot(x,vocab_size),axis=1)\n",`
390	`"out = keras.layers.Dense(4,activation='softmax')(x)\n",`
391	`"model = keras.models.Model(inp,out)\n",`
392	`"model.summary()\n",`
393	`"\n",`
394	`"model.compile(loss='sparse_categorical_crossentropy',optimizer='adam',metrics=['acc'])\n",`
395	`"model.fit(ds_train.map(tupelize).batch(batch_size),validation_data=ds_test.map(tupelize).batch(batch_size))\n"`
396	`]`
397	`},`
398	`{`
399	`"cell_type": "markdown",`
400	`"metadata": {},`
401	`"source": [`
402	`"## Bigrams, trigrams and n-grams\n",`
403	`"\n",`
404	`"One limitation of the bag-of-words approach is that some words are part of multi-word expressions, for example, the word 'hot dog' has a completely different meaning from the words 'hot' and 'dog' in other contexts. If we represent the words 'hot' and 'dog' always using the same vectors, it can confuse our model.\n",`
405	`"\n",`
406	`"To address this, n-gram representations are often used in methods of document classification, where the frequency of each word, bi-word or tri-word is a useful feature for training classifiers. In bigram representations, for example, we will add all word pairs to the vocabulary, in addition to original words.\n",`
407	`"\n",`
408	`"Below is an example of how to generate a bigram bag of word representation using Scikit Learn:"`
409	`]`
410	`},`
411	`{`
412	`"cell_type": "code",`
413	`"execution_count": 14,`
414	`"metadata": {},`
415	`"outputs": [`
416	`{`
417	`"name": "stdout",`
418	`"output_type": "stream",`
419	`"text": [`
420	`"Vocabulary:\n",`
421	`" {'i': 7, 'like': 11, 'hot': 4, 'dogs': 2, 'i like': 8, 'like hot': 12, 'hot dogs': 5, 'the': 16, 'dog': 0, 'ran': 14, 'fast': 3, 'the dog': 17, 'dog ran': 1, 'ran fast': 15, 'its': 9, 'outside': 13, 'its hot': 10, 'hot outside': 6}\n"`
422	`]`
423	`},`
424	`{`
425	`"data": {`
426	`"text/plain": [`
427	`"array([[1, 0, 1, 0, 2, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]],\n",`
428	`" dtype=int64)"`
429	`]`
430	`},`
431	`"execution_count": 14,`
432	`"metadata": {},`
433	`"output_type": "execute_result"`
434	`}`
435	`],`
436	`"source": [`
437	`"bigram_vectorizer = CountVectorizer(ngram_range=(1, 2), token_pattern=r'\\b\\w+\\b', min_df=1)\n",`
438	`"corpus = [\n",`
439	`" 'I like hot dogs.',\n",`
440	`" 'The dog ran fast.',\n",`
441	`" 'Its hot outside.',\n",`
442	`" ]\n",`
443	`"bigram_vectorizer.fit_transform(corpus)\n",`
444	`"print(\"Vocabulary:\\n\",bigram_vectorizer.vocabulary_)\n",`
445	`"bigram_vectorizer.transform(['My dog likes hot dogs on a hot day.']).toarray()\n"`
446	`]`
447	`},`
448	`{`
449	`"cell_type": "markdown",`
450	`"metadata": {},`
451	`"source": [`
452	`"The main drawback of the n-gram approach is that the vocabulary size starts to grow extremely fast. In practice, we need to combine the n-gram representation with a dimensionality reduction technique, such as embeddings, which we will discuss in the next unit.\n",`
453	`"\n",`
454	"To use an n-gram representation in our AG News dataset, we need to pass the `ngrams` parameter to our `TextVectorization` constructor. The length of a bigram vocaculary is significantly larger, in our case it is more than 1.3 million tokens! Thus it makes sense to limit bigram tokens as well by some reasonable number.\n",
455	`"\n",`
456	`"We could use the same code as above to train the classifier, however, it would be very memory-inefficient. In the next unit, we will train the bigram classifier using embeddings. In the meantime, you can experiment with bigram classifier training in this notebook and see if you can get higher accuracy."`
457	`]`
458	`},`
459	`{`
460	`"cell_type": "markdown",`
461	`"metadata": {},`
462	`"source": [`
463	`"## Automatically calculating BoW Vectors\n",`
464	`"\n",`
465	"In the example above we calculated BoW vectors by hand by summing the one-hot encodings of individual words. However, the latest version of TensorFlow allows us to calculate BoW vectors automatically by passing the `output_mode='count` parameter to the vectorizer constructor. This makes defining and training our model significanly easier:"
466	`]`
467	`},`
468	`{`
469	`"cell_type": "code",`
470	`"execution_count": 15,`
471	`"metadata": {},`
472	`"outputs": [`
473	`{`
474	`"name": "stdout",`
475	`"output_type": "stream",`
476	`"text": [`
477	`"Training vectorizer\n",`
478	`"938/938 [==============================] - 7s 7ms/step - loss: 0.5929 - acc: 0.8486 - val_loss: 0.4168 - val_acc: 0.8772\n"`
479	`]`
480	`},`
481	`{`
482	`"data": {`
483	`"text/plain": [`
484	`"<keras.callbacks.History at 0x20c725217c0>"`
485	`]`
486	`},`
487	`"execution_count": 15,`
488	`"metadata": {},`
489	`"output_type": "execute_result"`
490	`}`
491	`],`
492	`"source": [`
493	`"model = keras.models.Sequential([\n",`
494	`" keras.layers.experimental.preprocessing.TextVectorization(max_tokens=vocab_size,output_mode='count'),\n",`
495	`" keras.layers.Dense(4,input_shape=(vocab_size,), activation='softmax')\n",`
496	`"])\n",`
497	`"print(\"Training vectorizer\")\n",`
498	`"model.layers[0].adapt(ds_train.take(500).map(extract_text))\n",`
499	`"model.compile(loss='sparse_categorical_crossentropy',optimizer='adam',metrics=['acc'])\n",`
500	`"model.fit(ds_train.map(tupelize).batch(batch_size),validation_data=ds_test.map(tupelize).batch(batch_size))"`
501	`]`
502	`},`
503	`{`
504	`"cell_type": "markdown",`
505	`"metadata": {},`
506	`"source": [`
507	`"## Term frequency - inverse document frequency (TF-IDF)\n",`
508	`"\n",`
509	`"In BoW representation, word occurrences are weighted using the same technique regardless of the word itself. However, it's clear that frequent words such as a and in are much less important for classification than specialized terms. In most NLP tasks some words are more relevant than others.\n",`
510	`"\n",`
511	`"TF-IDF stands for term frequency - inverse document frequency. It's a variation of bag-of-words, where instead of a binary 0/1 value indicating the appearance of a word in a document, a floating-point value is used, which is related to the frequency of the word occurrence in the corpus.\n",`
512	`"\n",`
513	`"More formally, the weight $w_{ij}$ of a word $i$ in the document $j$ is defined as:\n",`
514	`"$$\n",`
515	`"w_{ij} = tf_{ij}\\times\\log({N\\over df_i})\n",`
516	`"$$\n",`
517	`"where\n",`
518	`"* $tf_{ij}$ is the number of occurrences of $i$ in $j$, i.e. the BoW value we have seen before\n",`
519	`"* $N$ is the number of documents in the collection\n",`
520	`"* $df_i$ is the number of documents containing the word $i$ in the whole collection\n",`
521	`"\n",`
522	`"The TF-IDF value $w_{ij}$ increases proportionally to the number of times a word appears in a document and is offset by the number of documents in the corpus that contains the word, which helps to adjust for the fact that some words appear more frequently than others. For example, if the word appears in every document in the collection, $df_i=N$, and $w_{ij}=0$, and those terms would be completely disregarded.\n",`
523	`"\n",`
524	`"You can easily create TF-IDF vectorization of text using Scikit Learn:"`
525	`]`
526	`},`
527	`{`
528	`"cell_type": "code",`
529	`"execution_count": 16,`
530	`"metadata": {},`
531	`"outputs": [`
532	`{`
533	`"data": {`
534	`"text/plain": [`
535	`"array([[0.43381609, 0. , 0.43381609, 0. , 0.65985664,\n",`
536	`" 0.43381609, 0. , 0. , 0. , 0. ,\n",`
537	`" 0. , 0. , 0. , 0. , 0. ,\n",`
538	`" 0. ]])"`
539	`]`
540	`},`
541	`"execution_count": 16,`
542	`"metadata": {},`
543	`"output_type": "execute_result"`
544	`}`
545	`],`
546	`"source": [`
547	`"from sklearn.feature_extraction.text import TfidfVectorizer\n",`
548	`"vectorizer = TfidfVectorizer(ngram_range=(1,2))\n",`
549	`"vectorizer.fit_transform(corpus)\n",`
550	`"vectorizer.transform(['My dog likes hot dogs on a hot day.']).toarray()"`
551	`]`
552	`},`
553	`{`
554	`"cell_type": "markdown",`
555	`"metadata": {},`
556	`"source": [`
557	"In Keras, the `TextVectorization` layer can automatically compute TF-IDF frequencies by passing the `output_mode='tf-idf'` parameter. Let's repeat the code we used above to see if using TF-IDF increases accuracy: "
558	`]`
559	`},`
560	`{`
561	`"cell_type": "code",`
562	`"execution_count": 17,`
563	`"metadata": {},`
564	`"outputs": [`
565	`{`
566	`"name": "stdout",`
567	`"output_type": "stream",`
568	`"text": [`
569	`"Training vectorizer\n",`
570	`"938/938 [==============================] - 12s 12ms/step - loss: 0.4197 - acc: 0.8662 - val_loss: 0.3432 - val_acc: 0.8849\n"`
571	`]`
572	`},`
573	`{`
574	`"data": {`
575	`"text/plain": [`
576	`"<keras.callbacks.History at 0x20c729dfd30>"`
577	`]`
578	`},`
579	`"execution_count": 17,`
580	`"metadata": {},`
581	`"output_type": "execute_result"`
582	`}`
583	`],`
584	`"source": [`
585	`"model = keras.models.Sequential([\n",`
586	`" keras.layers.experimental.preprocessing.TextVectorization(max_tokens=vocab_size,output_mode='tf-idf'),\n",`
587	`" keras.layers.Dense(4,input_shape=(vocab_size,), activation='softmax')\n",`
588	`"])\n",`
589	`"print(\"Training vectorizer\")\n",`
590	`"model.layers[0].adapt(ds_train.take(500).map(extract_text))\n",`
591	`"model.compile(loss='sparse_categorical_crossentropy',optimizer='adam',metrics=['acc'])\n",`
592	`"model.fit(ds_train.map(tupelize).batch(batch_size),validation_data=ds_test.map(tupelize).batch(batch_size))"`
593	`]`
594	`},`
595	`{`
596	`"cell_type": "markdown",`
597	`"metadata": {},`
598	`"source": [`
599	`"## Conclusion \n",`
600	`"\n",`
601	`"Even though TF-IDF representations provide frequency weights to different words, they are unable to represent meaning or order. As the famous linguist J. R. Firth said in 1935, \"The complete meaning of a word is always contextual, and no study of meaning apart from context can be taken seriously.\" We will learn how to capture contextual information from text using language modeling later in the course."`
602	`]`
603	`}`
604	`],`
605	`"metadata": {`
606	`"interpreter": {`
607	`"hash": "0cb620c6d4b9f7a635928804c26cf22403d89d98d79684e4529119355ee6d5a5"`
608	`},`
609	`"kernel_info": {`
610	`"name": "conda-env-py37_tensorflow-py"`
611	`},`
612	`"kernelspec": {`
613	`"display_name": "py37_tensorflow",`
614	`"language": "python",`
615	`"name": "python3"`
616	`},`
617	`"language_info": {`
618	`"codemirror_mode": {`
619	`"name": "ipython",`
620	`"version": 3`
621	`},`
622	`"file_extension": ".py",`
623	`"mimetype": "text/x-python",`
624	`"name": "python",`
625	`"nbconvert_exporter": "python",`
626	`"pygments_lexer": "ipython3",`
627	`"version": "3.8.12"`
628	`},`
629	`"nteract": {`
630	`"version": "nteract-front-end@1.0.0"`
631	`}`
632	`},`
633	`"nbformat": 4,`
634	`"nbformat_minor": 4`
635	`}`
636

microsoft/AI-For-Beginners

Branches

Tags

Clone