microsoft/AI-For-Beginners

Public

mirrored fromhttps://github.com/microsoft/AI-For-BeginnersAvailable

CodeCommitsIssuesPull requestsActionsInsightsSecurity
3ddc3a901c7b0ed1644fd70bd7d948b5e8085750

Branches

Tags

  • No tags available.
0Branches0Tags
Go to file
Add file
Code

Clone

HTTPS

Download ZIP

5-NLP/18-Transformers/TransformersTF.ipynb

816lines · modepreview

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Attention mechanisms and transformers\n",
    "\n",
    "One major drawback of recurrent networks is that all words in a sequence have the same impact on the result. This causes sub-optimal performance with standard LSTM encoder-decoder models for sequence to sequence tasks, such as Named Entity Recognition and Machine Translation. In reality specific words in the input sequence often have more impact on sequential outputs than others.\n",
    "\n",
    "Consider sequence-to-sequence model, such as machine translation. It is implemented by two recurrent networks, where one network (**encoder**) would collapse input sequence into hidden state, and another one, **decoder**, would unroll this hidden state into translated result. The problem with this approach is that final state of the network would have hard time remembering the beginning of a sentence, thus causing poor quality of the model on long sentences.\n",
    "\n",
    "**Attention Mechanisms** provide a means of weighting the contextual impact of each input vector on each output prediction of the RNN. The way it is implemented is by creating shortcuts between intermediate states of the input RNN, and output RNN. In this manner, when generating output symbol $y_t$, we will take into account all input hidden states $h_i$, with different weight coefficients $\\alpha_{t,i}$. \n",
    "\n",
    "![Image showing an encoder/decoder model with an additive attention layer](images/encoder-decoder-attention.png)\n",
    "*The encoder-decoder model with additive attention mechanism in [Bahdanau et al., 2015](https://arxiv.org/pdf/1409.0473.pdf), cited from [this blog post](https://lilianweng.github.io/lil-log/2018/06/24/attention-attention.html)*\n",
    "\n",
    "Attention matrix $\\{\\alpha_{i,j}\\}$ would represent the degree which certain input words play in generation of a given word in the output sequence. Below is the example of such a matrix:\n",
    "\n",
    "![Image showing a sample alignment found by RNNsearch-50, taken from Bahdanau - arviz.org](images/bahdanau-fig3.png)\n",
    "\n",
    "*Figure taken from [Bahdanau et al., 2015](https://arxiv.org/pdf/1409.0473.pdf) (Fig.3)*\n",
    "\n",
    "Attention mechanisms are responsible for much of the current or near current state of the art in Natural language processing. Adding attention however greatly increases the number of model parameters which led to scaling issues with RNNs. A key constraint of scaling RNNs is that the recurrent nature of the models makes it challenging to batch and parallelize training. In an RNN each element of a sequence needs to be processed in sequential order which means it cannot be easily parallelized.\n",
    "\n",
    "Adoption of attention mechanisms combined with this constraint led to the creation of the now State of the Art Transformer Models that we know and use today from BERT to OpenGPT3.\n",
    "\n",
    "## Transformer models\n",
    "\n",
    "Instead of forwarding the context of each previous prediction into the next evaluation step, **transformer models** use **positional encodings** and **attention** to capture the context of a given input with in a provided window of text. The image below shows how positional encodings with attention can capture context within a given window.\n",
    "\n",
    "![Animated GIF showing how the evaluations are performed in transformer models.](images/transformer-animated-explanation.gif) \n",
    "\n",
    "Since each input position is mapped independently to each output position, transformers can parallelize better than RNNs, which enables much larger and more expressive language models. Each attention head can be used to learn different relationships between words that improves downstream Natural Language Processing tasks.\n",
    "\n",
    "## Building Simple Transformer Model\n",
    "\n",
    "Keras does not contain built-in Transformer layer, but we can build our own. As before, we will focus on text classification of AG News dataset, but it is worth mentioning that Transformer models show best result at more difficult NLP tasks. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import tensorflow as tf\n",
    "from tensorflow import keras\n",
    "import tensorflow_datasets as tfds\n",
    "import numpy as np\n",
    "\n",
    "ds_train, ds_test = tfds.load('ag_news_subset').values()\n",
    "\n",
    "def extract_text(x):\n",
    "    return x['title']+' '+x['description']\n",
    "\n",
    "def tupelize(x):\n",
    "    return (extract_text(x),x['label'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "New layers in Keras should subclass `Layer` class, and implement `call` method. Let's start with **Positional Embedding** layer. We will use [some code from official Keras documentation](https://keras.io/examples/nlp/text_classification_with_transformer/). We will assume that we pad all input sequences to length `maxlen`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "class TokenAndPositionEmbedding(keras.layers.Layer):\n",
    "    def __init__(self, maxlen, vocab_size, embed_dim):\n",
    "        super(TokenAndPositionEmbedding, self).__init__()\n",
    "        self.token_emb = keras.layers.Embedding(input_dim=vocab_size, output_dim=embed_dim)\n",
    "        self.pos_emb = keras.layers.Embedding(input_dim=maxlen, output_dim=embed_dim)\n",
    "        self.maxlen = maxlen\n",
    "\n",
    "    def call(self, x):\n",
    "        maxlen = self.maxlen\n",
    "        positions = tf.range(start=0, limit=maxlen, delta=1)\n",
    "        positions = self.pos_emb(positions)\n",
    "        x = self.token_emb(x)\n",
    "        return x+positions"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This layer consists of two `Embedding` layers: for embedding tokens (in a way we have discussed before) and token positions. Token positions are created as a sequence of natural numbers from 0 to `maxlen` using `tf.range`, and then passed through embedding layer. Two resulting embedding vectors are then added, producing positionally-embedded reporesentation of input of shape `maxlen`$\\times$`embed_dim`.\n",
    "\n",
    "<img src=\"images/pos-embedding.png\" width=\"40%\"/>\n",
    "\n",
    "Now, let's implement the transformer block. It will take the output of previously defined embedding layer:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "class TransformerBlock(keras.layers.Layer):\n",
    "    def __init__(self, embed_dim, num_heads, ff_dim, rate=0.1):\n",
    "        super(TransformerBlock, self).__init__()\n",
    "        self.att = keras.layers.MultiHeadAttention(num_heads=num_heads, key_dim=embed_dim, name='attn')\n",
    "        self.ffn = keras.Sequential(\n",
    "            [keras.layers.Dense(ff_dim, activation=\"relu\"), keras.layers.Dense(embed_dim),]\n",
    "        )\n",
    "        self.layernorm1 = keras.layers.LayerNormalization(epsilon=1e-6)\n",
    "        self.layernorm2 = keras.layers.LayerNormalization(epsilon=1e-6)\n",
    "        self.dropout1 = keras.layers.Dropout(rate)\n",
    "        self.dropout2 = keras.layers.Dropout(rate)\n",
    "\n",
    "    def call(self, inputs, training):\n",
    "        attn_output = self.att(inputs, inputs)\n",
    "        attn_output = self.dropout1(attn_output, training=training)\n",
    "        out1 = self.layernorm1(inputs + attn_output)\n",
    "        ffn_output = self.ffn(out1)\n",
    "        ffn_output = self.dropout2(ffn_output, training=training)\n",
    "        return self.layernorm2(out1 + ffn_output)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Transformer applies `MultiHeadAttention` to the positionally-encoded input to produce the attention vector of the dimension `maxlen`$\\times$`embed_dim`, which is them mixed with input and normalized using `LayerNormalizaton`.\n",
    "\n",
    "> **Note**: `LayerNormalization` is similar to `BatchNormalization` discussed in the *Computer Vision* part of this learning path, but it normalizes outputs of the previous layer for each training sample independently, to bring them to the range [-1..1].\n",
    "\n",
    "Output of this layer is then passed through `Dense` network (in our case - two-layer perceptron), and the result is added to the final output (which undergoes normalization again).\n",
    "\n",
    "<img src=\"images/transformer-layer.png\" width=\"30%\" />\n",
    "\n",
    "Now, we are ready to define complete transformer model:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Model: \"sequential_1\"\n",
      "_________________________________________________________________\n",
      "Layer (type)                 Output Shape              Param #   \n",
      "=================================================================\n",
      "text_vectorization (TextVect (None, 256)               0         \n",
      "_________________________________________________________________\n",
      "token_and_position_embedding (None, 256, 32)           648192    \n",
      "_________________________________________________________________\n",
      "transformer_block (Transform (None, 256, 32)           10656     \n",
      "_________________________________________________________________\n",
      "global_average_pooling1d (Gl (None, 32)                0         \n",
      "_________________________________________________________________\n",
      "dropout_2 (Dropout)          (None, 32)                0         \n",
      "_________________________________________________________________\n",
      "dense_2 (Dense)              (None, 20)                660       \n",
      "_________________________________________________________________\n",
      "dropout_3 (Dropout)          (None, 20)                0         \n",
      "_________________________________________________________________\n",
      "dense_3 (Dense)              (None, 4)                 84        \n",
      "=================================================================\n",
      "Total params: 659,592\n",
      "Trainable params: 659,592\n",
      "Non-trainable params: 0\n",
      "_________________________________________________________________\n"
     ]
    }
   ],
   "source": [
    "embed_dim = 32  # Embedding size for each token\n",
    "num_heads = 2  # Number of attention heads\n",
    "ff_dim = 32  # Hidden layer size in feed forward network inside transformer\n",
    "maxlen = 256\n",
    "vocab_size = 20000\n",
    "\n",
    "model = keras.models.Sequential([\n",
    "    keras.layers.experimental.preprocessing.TextVectorization(max_tokens=vocab_size,output_sequence_length=maxlen, input_shape=(1,)),\n",
    "    TokenAndPositionEmbedding(maxlen, vocab_size, embed_dim),\n",
    "    TransformerBlock(embed_dim, num_heads, ff_dim),\n",
    "    keras.layers.GlobalAveragePooling1D(),\n",
    "    keras.layers.Dropout(0.1),\n",
    "    keras.layers.Dense(20, activation=\"relu\"),\n",
    "    keras.layers.Dropout(0.1),\n",
    "    keras.layers.Dense(4, activation=\"softmax\")\n",
    "])\n",
    "\n",
    "model.summary()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Training tokenizer\n",
      "938/938 [==============================] - 45s 39ms/step - loss: 0.4978 - acc: 0.8068 - val_loss: 0.2808 - val_acc: 0.9124\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "<tensorflow.python.keras.callbacks.History at 0x7f9c2427a0d0>"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "print('Training tokenizer')\n",
    "model.layers[0].adapt(ds_train.map(extract_text))\n",
    "model.compile(loss='sparse_categorical_crossentropy',metrics=['acc'], optimizer='adam')\n",
    "model.fit(ds_train.map(tupelize).batch(128),validation_data=ds_test.map(tupelize).batch(128))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## BERT Transformer Models\n",
    "\n",
    "**BERT** (Bidirectional Encoder Representations from Transformers) is a very large multi layer transformer network with 12 layers for *BERT-base*, and 24 for *BERT-large*. The model is first pre-trained on large corpus of text data (WikiPedia + books) using unsupervised training (predicting masked words in a sentence). During pre-training the model absorbs significant level of language understanding which can then be leveraged with other datasets using fine tuning. This process is called **transfer learning**. \n",
    "\n",
    "![picture from http://jalammar.github.io/illustrated-bert/](./images/jalammarBERT-language-modeling-masked-lm.png)\n",
    "\n",
    "There are many variations of Transformer architectures including BERT, DistilBERT. BigBird, OpenGPT3 and more that can be fine tuned. \n",
    "\n",
    "Let's see how we can use pre-trained BERT model for solving our traditional sequence classification problem. We will borrow the idea and some code from [official documentation](https://www.tensorflow.org/text/tutorials/classify_text_with_bert).\n",
    "\n",
    "To load pre-trained models, we will use **Tensorflow hub**. First, let's load the BERT-specific vectorizer:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "ename": "ModuleNotFoundError",
     "evalue": "No module named 'tensorflow_text'",
     "output_type": "error",
     "traceback": [
      "\u001b[1;31m---------------------------------------------------------------------------\u001b[0m",
      "\u001b[1;31mModuleNotFoundError\u001b[0m                       Traceback (most recent call last)",
      "\u001b[1;32m~\\AppData\\Local\\Temp/ipykernel_41180/4216669875.py\u001b[0m in \u001b[0;36m<module>\u001b[1;34m\u001b[0m\n\u001b[1;32m----> 1\u001b[1;33m \u001b[1;32mimport\u001b[0m \u001b[0mtensorflow_text\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m      2\u001b[0m \u001b[1;32mimport\u001b[0m \u001b[0mtensorflow_hub\u001b[0m \u001b[1;32mas\u001b[0m \u001b[0mhub\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m      3\u001b[0m \u001b[0mvectorizer\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mhub\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mKerasLayer\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;34m'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3'\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n",
      "\u001b[1;31mModuleNotFoundError\u001b[0m: No module named 'tensorflow_text'"
     ]
    }
   ],
   "source": [
    "import tensorflow_text \n",
    "import tensorflow_hub as hub\n",
    "vectorizer = hub.KerasLayer('https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'input_type_ids': <tf.Tensor: shape=(1, 128), dtype=int32, numpy=\n",
       " array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
       "         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
       "         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
       "         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
       "         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
       "         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]],\n",
       "       dtype=int32)>,\n",
       " 'input_word_ids': <tf.Tensor: shape=(1, 128), dtype=int32, numpy=\n",
       " array([[  101,  1045,  2293, 19081,   102,     0,     0,     0,     0,\n",
       "             0,     0,     0,     0,     0,     0,     0,     0,     0,\n",
       "             0,     0,     0,     0,     0,     0,     0,     0,     0,\n",
       "             0,     0,     0,     0,     0,     0,     0,     0,     0,\n",
       "             0,     0,     0,     0,     0,     0,     0,     0,     0,\n",
       "             0,     0,     0,     0,     0,     0,     0,     0,     0,\n",
       "             0,     0,     0,     0,     0,     0,     0,     0,     0,\n",
       "             0,     0,     0,     0,     0,     0,     0,     0,     0,\n",
       "             0,     0,     0,     0,     0,     0,     0,     0,     0,\n",
       "             0,     0,     0,     0,     0,     0,     0,     0,     0,\n",
       "             0,     0,     0,     0,     0,     0,     0,     0,     0,\n",
       "             0,     0,     0,     0,     0,     0,     0,     0,     0,\n",
       "             0,     0,     0,     0,     0,     0,     0,     0,     0,\n",
       "             0,     0,     0,     0,     0,     0,     0,     0,     0,\n",
       "             0,     0]], dtype=int32)>,\n",
       " 'input_mask': <tf.Tensor: shape=(1, 128), dtype=int32, numpy=\n",
       " array([[1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
       "         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
       "         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
       "         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
       "         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
       "         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]],\n",
       "       dtype=int32)>}"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "vectorizer(['I love transformers'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "It is important that you use the same vectorizer as the one that the original network was trained on. Also, BERT vectorizer returns three components:\n",
    "* `input_word_ids`, which is a sequence of token numbers for input sentence\n",
    "* `input_mask`, showing which part of the sequence contains actual input, and which one is padding. It is similar to the mask produced by `Masking` layer\n",
    "* `input_type_ids` is used for language modeling tasks, and allows to specify two input sentences in one sequence.\n",
    "\n",
    "Then, we can instantiate BERT feature extractor:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "bert = hub.KerasLayer('https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-4_H-128_A-2/1')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "pooled_output -> (1, 128)\n",
      "encoder_outputs -> 4\n",
      "sequence_output -> (1, 128, 128)\n",
      "default -> (1, 128)\n"
     ]
    }
   ],
   "source": [
    "z = bert(vectorizer(['I love transformers']))\n",
    "for i,x in z.items():\n",
    "    print(f\"{i} -> { len(x) if isinstance(x, list) else x.shape }\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "So, BERT layer returns a number of useful results:\n",
    "* `pooled_output` is a result of averaging out all tokens in the sequence. You can view it as an intelligent semantic embedding of the whole network. It is equivalent to the output of `GlobalAveragePooling1D` layer in our previous model.\n",
    "* `sequence_output` is the output of the last transformer layer (corresponds to the output of `TransformerBlock` in our model above)\n",
    "* `encoder_outputs` are the outputs of all transformer layers. Since we have loaded 4-layer BERT model (as you can probably guess from the name, which contains `4_H`), it has 4 tensors. The last one is the same as `sequence_output`.\n",
    "\n",
    "Now we will define the end-to-end classification model. We will use *functional model definition*, when we define model input, and then provide a series of expressions to calculate its output. We will also make BERT model weights not-trainable, and train just the final classifier:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Model: \"model\"\n",
      "__________________________________________________________________________________________________\n",
      "Layer (type)                    Output Shape         Param #     Connected to                     \n",
      "==================================================================================================\n",
      "input_1 (InputLayer)            [(None,)]            0                                            \n",
      "__________________________________________________________________________________________________\n",
      "keras_layer (KerasLayer)        {'input_type_ids': ( 0           input_1[0][0]                    \n",
      "__________________________________________________________________________________________________\n",
      "keras_layer_1 (KerasLayer)      {'pooled_output': (N 4782465     keras_layer[0][0]                \n",
      "                                                                 keras_layer[0][1]                \n",
      "                                                                 keras_layer[0][2]                \n",
      "__________________________________________________________________________________________________\n",
      "dropout_4 (Dropout)             (None, 128)          0           keras_layer_1[0][5]              \n",
      "__________________________________________________________________________________________________\n",
      "dense_4 (Dense)                 (None, 4)            516         dropout_4[0][0]                  \n",
      "==================================================================================================\n",
      "Total params: 4,782,981\n",
      "Trainable params: 516\n",
      "Non-trainable params: 4,782,465\n",
      "__________________________________________________________________________________________________\n"
     ]
    }
   ],
   "source": [
    "inp = keras.Input(shape=(),dtype=tf.string)\n",
    "x = vectorizer(inp)\n",
    "x = bert(x)\n",
    "x = keras.layers.Dropout(0.1)(x['pooled_output'])\n",
    "out = keras.layers.Dense(4,activation='softmax')(x)\n",
    "model = keras.models.Model(inp,out)\n",
    "bert.trainable = False\n",
    "model.summary()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "938/938 [==============================] - 528s 559ms/step - loss: 0.8056 - acc: 0.6983 - val_loss: 0.5953 - val_acc: 0.7888\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "<tensorflow.python.keras.callbacks.History at 0x7f9bb1e36d00>"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "model.compile(loss='sparse_categorical_crossentropy',metrics=['acc'], optimizer='adam')\n",
    "model.fit(ds_train.map(tupelize).batch(128),validation_data=ds_test.map(tupelize).batch(128))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Despite the fact that there are few trainable parameters, the process is pretty slow, because BERT feature extractor is computationally heavy. It looks like we were unable to achieve reasonable accuracy, either due to lack of training, or lack of model parameters.\n",
    "\n",
    "Let's try to unfreeze BERT weights and train it as well. This requires very small learning rate, and also more careful training strategy with **warmup**, using **AdamW** optimizer. We will use `tf-models-official` package to create the optimizer:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Model: \"model\"\n",
      "__________________________________________________________________________________________________\n",
      "Layer (type)                    Output Shape         Param #     Connected to                     \n",
      "==================================================================================================\n",
      "input_1 (InputLayer)            [(None,)]            0                                            \n",
      "__________________________________________________________________________________________________\n",
      "keras_layer (KerasLayer)        {'input_type_ids': ( 0           input_1[0][0]                    \n",
      "__________________________________________________________________________________________________\n",
      "keras_layer_1 (KerasLayer)      {'pooled_output': (N 4782465     keras_layer[0][0]                \n",
      "                                                                 keras_layer[0][1]                \n",
      "                                                                 keras_layer[0][2]                \n",
      "__________________________________________________________________________________________________\n",
      "dropout_4 (Dropout)             (None, 128)          0           keras_layer_1[0][5]              \n",
      "__________________________________________________________________________________________________\n",
      "dense_4 (Dense)                 (None, 4)            516         dropout_4[0][0]                  \n",
      "==================================================================================================\n",
      "Total params: 4,782,981\n",
      "Trainable params: 4,782,980\n",
      "Non-trainable params: 1\n",
      "__________________________________________________________________________________________________\n",
      "938/938 [==============================] - 629s 664ms/step - loss: 0.6344 - acc: 0.7658 - val_loss: 0.4876 - val_acc: 0.8247\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "<tensorflow.python.keras.callbacks.History at 0x7f9bb0bd0070>"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from official.nlp import optimization \n",
    "bert.trainable=True\n",
    "model.summary()\n",
    "epochs = 3\n",
    "opt = optimization.create_optimizer(\n",
    "    init_lr=3e-5,\n",
    "    num_train_steps=epochs*len(ds_train),\n",
    "    num_warmup_steps=0.1*epochs*len(ds_train),\n",
    "    optimizer_type='adamw')\n",
    "\n",
    "model.compile(loss='sparse_categorical_crossentropy',metrics=['acc'], optimizer=opt)\n",
    "model.fit(ds_train.map(tupelize).batch(128),validation_data=ds_test.map(tupelize).batch(128))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As you can see, the training goes quite slowly - but you may want to experiment and train the model for a few epochs (5-10) and see if you can get the best result comparing to the approaches we have used before.\n",
    "\n",
    "## Huggingface Transformers Library\n",
    "\n",
    "Another very common (and a bit simpler) way to use Transformer models is [HuggingFace package](https://github.com/huggingface/), which provides simple building blocks for different NLP tasks. It is available both for Tensorflow and PyTorch, another very popular neural network framework. \n",
    "\n",
    "> **Note**: If you are not interested in seeing how Transformers library works - you may skip to the end of this notebook, because you will not see anything substantially different from what we have done above. We will be repeating the same steps of training BERT model using different library and substantially larger model. Thus, the process involves some rather long training, so you may want just to look through the code.\n",
    "\n",
    "Let's see how our problem can be solved using [Huggingface Transformers](http://huggingface.co)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "First thing we need to do is to chose the model that we will be using. In addition to some built-in models, Huggingface contains an [online model repository](https://huggingface.co/models), where you can find a lot more pre-trained models by the community. All of those models can be loaded and used just by providing a model name. All required binary files for the model would automatically be downloaded.\n",
    "\n",
    "At certain times you would need to load your own models, in which case you can specify the directory that contains all relevant files, including parameters for tokenizer, `config.json` file with model parameters, binary weights, etc.\n",
    "\n",
    "From model name, we can instantiate both the model and the tokenizer. Let's start with a tokenizer:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "import transformers\n",
    "\n",
    "# To load the model from Internet repository using model name. \n",
    "# Use this if you are running from your own copy of the notebooks\n",
    "bert_model = 'bert-base-uncased' \n",
    "\n",
    "# To load the model from the directory on disk. Use this for Microsoft Learn module, because we have\n",
    "# prepared all required files for you.\n",
    "#bert_model = './bert'\n",
    "\n",
    "tokenizer = transformers.BertTokenizer.from_pretrained(bert_model)\n",
    "\n",
    "MAX_SEQ_LEN = 128\n",
    "PAD_INDEX = tokenizer.convert_tokens_to_ids(tokenizer.pad_token)\n",
    "UNK_INDEX = tokenizer.convert_tokens_to_ids(tokenizer.unk_token)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The `tokenizer` object contains the `encode` function that can be directly used to encode text:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[101, 23435, 12314, 2003, 1037, 2307, 7705, 2005, 17953, 2361, 102]"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "tokenizer.encode('Tensorflow is a great framework for NLP')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can also use tokenizer to encode a sequence is a way suitable for passing to the model, i.e. including `token_ids`, `input_mask` fields, etc. We can also specify that we want Tensorflow tensors by providing `return_tensors='tf'` argument:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'input_ids': <tf.Tensor: shape=(1, 5), dtype=int32, numpy=array([[ 101, 7592, 1010, 2045,  102]], dtype=int32)>, 'token_type_ids': <tf.Tensor: shape=(1, 5), dtype=int32, numpy=array([[0, 0, 0, 0, 0]], dtype=int32)>, 'attention_mask': <tf.Tensor: shape=(1, 5), dtype=int32, numpy=array([[1, 1, 1, 1, 1]], dtype=int32)>}"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "tokenizer(['Hello, there'],return_tensors='tf')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In our case, we will be using pre-trained BERT model called `bert-base-uncased`. *Uncased* indicates that the model in case-insensitive. \n",
    "\n",
    "When training the model, we need to provide tokenized sequence as input, and thus we will design data processing pipeline. Since `tokenizer.encode` is a Python function, we will use the same approach as in the last unit with calling it using `py_function`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {},
   "outputs": [],
   "source": [
    "def process(x):\n",
    "    return tokenizer.encode(x.numpy().decode('utf-8'),return_tensors='tf',padding='max_length',max_length=MAX_SEQ_LEN,truncation=True)[0]\n",
    "\n",
    "def process_fn(x):\n",
    "    s = x['title']+' '+x['description']\n",
    "    e = tf.py_function(process,inp=[s],Tout=(tf.int32))\n",
    "    e.set_shape(MAX_SEQ_LEN)\n",
    "    return e,x['label']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we can load the actual model using `BertForSequenceClassfication` package. This ensures that our model already has a required architecture for classification, including final classifier. You will see warning message stating that weights of the final classifier are not initialized, and model would require pre-training - that is perfectly okay, because it is exactly what we are about to do!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {},
   "outputs": [],
   "source": [
    "model = transformers.TFBertForSequenceClassification.from_pretrained(bert_model,num_labels=4,output_attentions=False)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Model: \"tf_bert_for_sequence_classification_1\"\n",
      "_________________________________________________________________\n",
      "Layer (type)                 Output Shape              Param #   \n",
      "=================================================================\n",
      "bert (TFBertMainLayer)       multiple                  109482240 \n",
      "_________________________________________________________________\n",
      "dropout_75 (Dropout)         multiple                  0         \n",
      "_________________________________________________________________\n",
      "classifier (Dense)           multiple                  3076      \n",
      "=================================================================\n",
      "Total params: 109,485,316\n",
      "Trainable params: 109,485,316\n",
      "Non-trainable params: 0\n",
      "_________________________________________________________________\n"
     ]
    }
   ],
   "source": [
    "model.summary()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As you can see from `summary()`, the model contains almost 110 million parameters! Presumably, if we want simple classification task on relatively small dataset, we do not want to train the BERT base layer:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Model: \"tf_bert_for_sequence_classification_1\"\n",
      "_________________________________________________________________\n",
      "Layer (type)                 Output Shape              Param #   \n",
      "=================================================================\n",
      "bert (TFBertMainLayer)       multiple                  109482240 \n",
      "_________________________________________________________________\n",
      "dropout_75 (Dropout)         multiple                  0         \n",
      "_________________________________________________________________\n",
      "classifier (Dense)           multiple                  3076      \n",
      "=================================================================\n",
      "Total params: 109,485,316\n",
      "Trainable params: 3,076\n",
      "Non-trainable params: 109,482,240\n",
      "_________________________________________________________________\n"
     ]
    }
   ],
   "source": [
    "model.layers[0].trainable = False\n",
    "model.summary()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we are ready to begin training!\n",
    "\n",
    "> **Note**: Training full-scale BERT model can be very time consuming! Thus we will only train it for the first 32 batches. This is just to show how model training is set up. If you are interested to try full-scale training - just remove `steps_per_epoch` and `validation_steps` parameters, and prepare to wait!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "32/32 [==============================] - 142s 4s/step - loss: 1.3896 - acc: 0.2500 - val_loss: 1.3863 - val_acc: 0.2480\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "<tensorflow.python.keras.callbacks.History at 0x7f1d40a4b6a0>"
      ]
     },
     "execution_count": 30,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "model.compile('adam','sparse_categorical_crossentropy',['acc'])\n",
    "tf.get_logger().setLevel('ERROR')\n",
    "model.fit(ds_train.map(process_fn).batch(32),validation_data=ds_test.map(process_fn).batch(32),steps_per_epoch=32,validation_steps=2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If you increase the number of iterations and wait long enough, and train for several epochs, you can expect that BERT classification gives us the best accuracy! That is because BERT already understands quite well the structure of the language, and we only need to fine-tune final classifier. However, because BERT is a large model, the whole training process takes a long time, and requires serious computational power! (GPU, and preferably more than one).\n",
    "\n",
    "> **Note:** In our example, we have been using one of the smallest pre-trained BERT models. There are larger models that are likely to yield better results."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Takeaway\n",
    "\n",
    "In this unit, we have seen very recent model architectures based on **transformers**. We have applied them for our text classification task, but similarly, BERT models can be used for entity extraction, question answering, and other NLP tasks.\n",
    "\n",
    "Transformer models represent current state-of-the-art in NLP, and in most of the cases it should be the first solution you start experimenting with when implementing custom NLP solutions. However, understanding basic underlying principles of recurrent neural networks discussed in this module is extremely important if you want to build advanced neural models."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "interpreter": {
   "hash": "0cb620c6d4b9f7a635928804c26cf22403d89d98d79684e4529119355ee6d5a5"
  },
  "kernelspec": {
   "display_name": "py38_tensorflow",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.12"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}