diff --git a/19_Natural_Language_Processing.ipynb b/19_Natural_Language_Processing.ipynb new file mode 100644 index 0000000..6ce3e9d --- /dev/null +++ b/19_Natural_Language_Processing.ipynb @@ -0,0 +1,400 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "5be301c4-61af-4f40-be5f-57f2ba4b7733", + "metadata": {}, + "source": [ + "# Natural Language Processing Intro\n", + "\n", + "Natural Language Processing (NLP) is a large field of AI with many related, but distinct sub-disciplines. We won't have time to look at all of these, but you are most likely somewhat familir with many of the applications.\n", + "\n", + "Some NLP Tasks:\n", + "* Summary generation, information extractions\n", + "* Translation (language to language)\n", + "* Transcription (speech to text)\n", + "* Auto-completion\n", + "* Sentiment Analysis\n", + "* Intent Detection\n", + "* Chat, automated writing, dialog generation, question answerting\n", + "* Voice assistant\n", + "* Document retrieval\n", + "\n", + "## Ambiguity in language\n", + "\n", + "NLP is not a simple task, and until deep learning, was quite limited. Part of the challenge is that human language tends to be ambiguous and recognizing words is really only the start to infering meaning. Take for example this sentence:\n", + " > The boy saw a man with a telescope\n", + " \n", + " * Who had the telescope?\n", + " \n", + "More context is needed to answer this. Yet, context is not always there and even when it is, can be a challenge for NLP methods.\n", + "\n", + "## Tokenization\n", + "\n", + "One of the primary challenges of NLP is representing language as numbers--remember computers, and the ML/AI systems we have, primarily deal with numbers. For computer vision problems, this was relatively easy in that we took pixel intensities of an image and fed those in. But what to do with speech, words, text?\n", + "\n", + "The processes of converting text to numerical representation is called **tokenization**. There are many mehods of tokenization, but the idea is to break text into itemizable components.\n", + "\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "9cda4b88-5387-4814-9748-24f17dabc0e7", + "metadata": {}, + "outputs": [], + "source": [ + "# Some of the examples here are adapted from https://github.com/NVDLI/LDL\n", + "# Which supports the book Learning Deep Learning (LDL) by Magnus Ekman (ISBN: 9780137470358).\n", + "\n", + "\"\"\"\n", + "The MIT License (MIT)\n", + "Copyright (c) 2021 NVIDIA\n", + "Permission is hereby granted, free of charge, to any person obtaining a copy of\n", + "this software and associated documentation files (the \"Software\"), to deal in\n", + "the Software without restriction, including without limitation the rights to\n", + "use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of\n", + "the Software, and to permit persons to whom the Software is furnished to do so,\n", + "subject to the following conditions:\n", + "The above copyright notice and this permission notice shall be included in all\n", + "copies or substantial portions of the Software.\n", + "THE SOFTWARE IS PROVIDED \"AS IS\", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR\n", + "IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS\n", + "FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR\n", + "COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER\n", + "IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN\n", + "CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.\n", + "\"\"\"" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "784964dc-9e92-4314-add7-4341a8e79f45", + "metadata": {}, + "outputs": [], + "source": [ + "import numpy as np\n", + "from tensorflow.keras.models import Sequential\n", + "from tensorflow.keras.layers import Dense\n", + "from tensorflow.keras.layers import LSTM\n", + "from tensorflow.keras.layers import Embedding\n", + "from tensorflow.keras.preprocessing.text import Tokenizer\n", + "from tensorflow.keras.preprocessing.text \\\n", + " import text_to_word_sequence\n", + "import tensorflow as tf\n", + "import logging\n", + "tf.get_logger().setLevel(logging.ERROR)\n", + "\n", + "EPOCHS = 10 # origonally 32, reduced for time.\n", + "BATCH_SIZE = 256\n", + "INPUT_FILE_NAME = 'data/frankenstein.txt'\n", + "WINDOW_LENGTH = 40\n", + "WINDOW_STEP = 3\n", + "PREDICT_LENGTH = 3\n", + "MAX_WORDS = 10000\n", + "EMBEDDING_WIDTH = 100" + ] + }, + { + "cell_type": "markdown", + "id": "657c20dc-65b3-4133-93ba-94dfed96b89e", + "metadata": {}, + "source": [ + "## Cleaning, tokenization and creation of training set\n", + "\n", + "In the following block, we read in the Frankenstein text file and use the `text_to_word_sequence` to convert the text to a list of individual words. This also removes punctuation and converts everything to lower case. This command accomplishes our cleaning and tokenization steps.\n", + "\n", + "The next step is to creat the training set of `fragments` and corresponding `targets`. The hyperparameters were set above, and used here to make a tranining set by sliding a frame overs the text, `WINDOW_LENGTH=40` words at a time. \n", + "\n", + "The following word becomes the target and the frame shifts down `WINDOW_STEP=3` words to make the next fragment/target pair." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "ace86526-a34a-4790-acd9-7ba7e266b4db", + "metadata": {}, + "outputs": [], + "source": [ + "# Open and read file.\n", + "file = open(INPUT_FILE_NAME, 'r', encoding='utf-8-sig')\n", + "text = file.read()\n", + "file.close()\n", + "\n", + "# Make lower case and split into individual words.\n", + "text = text_to_word_sequence(text)\n", + "\n", + "# Create training examples.\n", + "fragments = []\n", + "targets = []\n", + "for i in range(0, len(text) - WINDOW_LENGTH, WINDOW_STEP):\n", + " fragments.append(text[i: i + WINDOW_LENGTH])\n", + " targets.append(text[i + WINDOW_LENGTH])" + ] + }, + { + "cell_type": "markdown", + "id": "2a22d67e-d404-47b5-947f-838416c978ea", + "metadata": {}, + "source": [ + "### Look at one fragment/target pair\n", + "\n", + "The following text from the book happens to be *fragment*/**target** pair 94:\n", + " > *I am already far north of London, and as I walk in the streets of\n", + "Petersburgh, I feel a cold northern breeze play upon my cheeks, which\n", + "braces my nerves and fills me with delight. Do you understand* **this**\n", + "feeling?" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "ea38b43c-8b9e-481a-bae8-1ea8d7dd1b97", + "metadata": {}, + "outputs": [], + "source": [ + "print(f'Fragment: {fragments[94]}\\nTarget: {targets[94]}')\n", + "\n", + "print(f'\\nTotal fragments: {len(fragments)}')" + ] + }, + { + "cell_type": "markdown", + "id": "7ef736ba-456f-4094-9229-b09eb77dba47", + "metadata": {}, + "source": [ + "So, now we have the whole text split into 26,238 fragments, each with the following word as the target. This is what our model will be trained on.\n", + "\n", + "But we have another step before we can begin training.\n", + "\n", + "## Embedding\n", + "\n", + "Somehow we need to convert the words (both in the fragments and the targets) to numbers. Embedding goes a step further, representing words in n-dimensional vector space. This vector space is a lower dimensionality than the number of words in the vocabulary (as opposed to one-hot encoding) and results in words with with similar meaning being closer in space than other word.\n", + "\n", + "Here's an example figure showing some embeddings (taken from Renu Khandelwal's article [*Word Embeddings for NLP*](https://towardsdatascience.com/word-embeddings-for-nlp-5b72991e01d4):\n", + "\n", + "\n", + "![Image of word embeddings from https://towardsdatascience.com/word-embeddings-for-nlp-5b72991e01d4 ](images/word_embeddings_Renu_Khandelwal.png)\n", + "\n", + "The embedding is learned in the training process. In this step, we are assigning numbers to words. Above, we set `MAX_WORDS=10000` so, once we have 10,000 words in the vocabulary, all additional words will be assigned to the `UNK` class." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "17c65bda-8e86-4918-9b83-a8d5852640df", + "metadata": {}, + "outputs": [], + "source": [ + "# Convert to indices.\n", + "tokenizer = Tokenizer(num_words=MAX_WORDS, oov_token='UNK')\n", + "tokenizer.fit_on_texts(text)\n", + "fragments_indexed = tokenizer.texts_to_sequences(fragments)\n", + "targets_indexed = tokenizer.texts_to_sequences(targets)\n", + "\n", + "# Convert to appropriate input and output formats.\n", + "X = np.array(fragments_indexed, dtype=np.int64)\n", + "y = np.zeros((len(targets_indexed), MAX_WORDS))\n", + "for i, target_index in enumerate(targets_indexed):\n", + " y[i, target_index] = 1" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "4b32991b-10bc-491f-8fa9-8b799b271511", + "metadata": {}, + "outputs": [], + "source": [ + "print(f'Fragment as numbers: {X[94]} \\nTarget is one-hot encoded: {y[94]}')" + ] + }, + { + "cell_type": "markdown", + "id": "4d64624f-86fc-4d31-928b-b4949fdde18a", + "metadata": {}, + "source": [ + "## Build and fit the model\n", + "\n", + "Our first layer is the embedding layer that learns to conver the numbered words in the vocabulary to the embedding. Then we have two LSTM layers and a dense layer with a ReLU activation and a final layer with one neuron per word in the vocabulary and a softmax to make the output the probability of each word being the output from the model." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "55cf3e22-815f-4c91-8c1b-5f1c20d91e3c", + "metadata": {}, + "outputs": [], + "source": [ + "# Build and train model.\n", + "training_model = Sequential()\n", + "training_model.add(Embedding(\n", + " output_dim=EMBEDDING_WIDTH, input_dim=MAX_WORDS,\n", + " mask_zero=True, input_length=None))\n", + "training_model.add(LSTM(128, return_sequences=True,\n", + " dropout=0.2, recurrent_dropout=0.2))\n", + "training_model.add(LSTM(128, dropout=0.2,\n", + " recurrent_dropout=0.2))\n", + "training_model.add(Dense(128, activation='relu'))\n", + "training_model.add(Dense(MAX_WORDS, activation='softmax'))\n", + "training_model.compile(loss='categorical_crossentropy',\n", + " optimizer='adam')\n", + "training_model.summary()\n", + "history = training_model.fit(X, y, validation_split=0.05,\n", + " batch_size=BATCH_SIZE, \n", + " epochs=EPOCHS, verbose=2, \n", + " shuffle=True)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "dc4967c2-0654-4341-aee8-2922dd98fb7e", + "metadata": {}, + "outputs": [], + "source": [ + "# Build stateful model used for prediction.\n", + "inference_model = Sequential()\n", + "inference_model.add(Embedding(\n", + " output_dim=EMBEDDING_WIDTH, input_dim=MAX_WORDS,\n", + " mask_zero=True, batch_input_shape=(1, 1)))\n", + "inference_model.add(LSTM(128, return_sequences=True,\n", + " dropout=0.2, recurrent_dropout=0.2,\n", + " stateful=True))\n", + "inference_model.add(LSTM(128, dropout=0.2,\n", + " recurrent_dropout=0.2, stateful=True))\n", + "inference_model.add(Dense(128, activation='relu'))\n", + "inference_model.add(Dense(MAX_WORDS, activation='softmax'))\n", + "weights = training_model.get_weights()\n", + "inference_model.set_weights(weights)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "7146c8ad-33a5-481a-9189-df5351555a59", + "metadata": {}, + "outputs": [], + "source": [ + "# Provide beginning of sentence and\n", + "# predict next words in a greedy manner\n", + "first_words = ['i', 'saw']\n", + "first_words_indexed = tokenizer.texts_to_sequences(\n", + " first_words)\n", + "inference_model.reset_states()\n", + "predicted_string = ''\n", + "\n", + "# Feed initial words to the model.\n", + "for i, word_index in enumerate(first_words_indexed):\n", + " x = np.zeros((1, 1), dtype=np.int64)\n", + " x[0][0] = word_index[0]\n", + " predicted_string += first_words[i]\n", + " predicted_string += ' '\n", + " y_predict = inference_model.predict(x, verbose=0)[0]\n", + "\n", + "# Predict PREDICT_LENGTH words.\n", + "for i in range(PREDICT_LENGTH):\n", + " new_word_index = np.argmax(y_predict)\n", + " word = tokenizer.sequences_to_texts(\n", + " [[new_word_index]])\n", + " x[0][0] = new_word_index\n", + " predicted_string += word[0]\n", + " predicted_string += ' '\n", + " y_predict = inference_model.predict(x, verbose=0)[0]\n", + "print(predicted_string)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "f1149ed9-0b09-4a93-be31-63a39b9d0ef3", + "metadata": {}, + "outputs": [], + "source": [ + "# Let's predict more by changing predict length to 10\n", + "\n", + "PREDICT_LENGTH = 10\n", + "\n", + "# Predict PREDICT_LENGTH words.\n", + "for i in range(PREDICT_LENGTH):\n", + " new_word_index = np.argmax(y_predict)\n", + " word = tokenizer.sequences_to_texts(\n", + " [[new_word_index]])\n", + " x[0][0] = new_word_index\n", + " predicted_string += word[0]\n", + " predicted_string += ' '\n", + " y_predict = inference_model.predict(x, verbose=0)[0]\n", + "print(predicted_string)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "45f78ebc-5fd0-46b7-bed8-e873f1897fbb", + "metadata": {}, + "outputs": [], + "source": [ + "# Explore embedding similarities.\n", + "embeddings = training_model.layers[0].get_weights()[0]\n", + "lookup_words = ['the', 'saw', 'see', 'of', 'and',\n", + " 'monster', 'frankenstein', 'read', 'eat']\n", + "for lookup_word in lookup_words:\n", + " lookup_word_indexed = tokenizer.texts_to_sequences(\n", + " [lookup_word])\n", + " print('words close to:', lookup_word)\n", + " lookup_embedding = embeddings[lookup_word_indexed[0]]\n", + " word_indices = {}\n", + " # Calculate distances.\n", + " for i, embedding in enumerate(embeddings):\n", + " distance = np.linalg.norm(\n", + " embedding - lookup_embedding)\n", + " word_indices[distance] = i\n", + " # Print sorted by distance.\n", + " for distance in sorted(word_indices.keys())[:5]:\n", + " word_index = word_indices[distance]\n", + " word = tokenizer.sequences_to_texts([[word_index]])[0]\n", + " print(word + ': ', distance)\n", + " print('')" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "c426982f-9e36-4bac-abf3-f59fc4b162a5", + "metadata": {}, + "outputs": [], + "source": [] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "c51e4e1d-7196-4efa-b432-969741655a7b", + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Tensorflow-2.6.0", + "language": "python", + "name": "tensorflow-2.6.0" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.9.7" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/20_NLP_Transformers.ipynb b/20_NLP_Transformers.ipynb new file mode 100644 index 0000000..0b962ad --- /dev/null +++ b/20_NLP_Transformers.ipynb @@ -0,0 +1,867 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "bd9e9e85-e6ad-4e3e-a696-e329aeb8f60c", + "metadata": {}, + "source": [ + "# Transformers\n", + "\n", + "
\n", + " Note this notebook should be run using the nlp-1.2 kernel on HiPerGator.\n", + "
\n", + " \n", + "\n", + "This notebook largely follows, and quotes from, the [Hugging Face Transformer Course](https://huggingface.co/course/chapter1/3?fw=tf). It starts with some motivating examples.\n", + "\n", + "Here are some examples of what Trasformers can do in NLP:\n", + "\n", + "## Some Examples\n", + "\n", + "### Sentement Analysis" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "id": "75b6f9b6-f59b-4736-8862-f60fc6e3ae75", + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english)\n" + ] + }, + { + "data": { + "text/plain": [ + "[{'label': 'POSITIVE', 'score': 0.9598047137260437}]" + ] + }, + "execution_count": 2, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "from transformers import pipeline\n", + "\n", + "classifier = pipeline(\"sentiment-analysis\")\n", + "classifier(\"I've been waiting for a HuggingFace course my whole life.\")" + ] + }, + { + "cell_type": "markdown", + "id": "4cd60204-e7e6-4f82-9c6b-4cca96367ffc", + "metadata": {}, + "source": [ + "> By default, this pipeline selects a particular pretrained model that has been fine-tuned for sentiment analysis in English. The model is downloaded and cached when you create the `classifier` object. If you rerun the command, the cached model will be used instead and there is no need to download the model again.\n", + ">\n", + "> There are three main steps involved when you pass some text to a pipeline:\n", + ">\n", + "> 1. The text is preprocessed into a format the model can understand.\n", + "> 1. The preprocessed inputs are passed to the model.\n", + "> 1. The predictions of the model are post-processed, so you can make sense of them.\n", + ">\n", + "> Some of the currently [available pipelines](https://huggingface.co/transformers/main_classes/pipelines.html) are:\n", + "> \n", + "> * `feature-extraction` (get the vector representation of a text)\n", + "> * `fill-mask`\n", + "> * `ner` (named entity recognition)\n", + "> * `question-answering`\n", + "> * `sentiment-analysis`\n", + "> * `summarization`\n", + "> * `text-generation`\n", + "> * `translation`\n", + "> * `zero-shot-classification`\n", + ">\n", + ">Let’s have a look at a few of these!\n", + "\n", + "### Zero-shot classification\n", + "\n", + "> We’ll start by tackling a more challenging task where we need to classify texts that haven’t been labelled. This is a common scenario in real-world projects because annotating text is usually time-consuming and requires domain expertise. For this use case, the zero-shot-classification pipeline is very powerful: it allows you to specify which labels to use for the classification, so you don’t have to rely on the labels of the pretrained model. You’ve already seen how the model can classify a sentence as positive or negative using those two labels — but it can also classify the text using any other set of labels you like." + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "id": "d583a893-0300-49b8-89b7-eba4d68f7855", + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "No model was supplied, defaulted to facebook/bart-large-mnli (https://huggingface.co/facebook/bart-large-mnli)\n" + ] + }, + { + "data": { + "application/vnd.jupyter.widget-view+json": { + "model_id": "163351fb838a4509b6c56f3e369f85ee", + "version_major": 2, + "version_minor": 0 + }, + "text/plain": [ + "Downloading: 0%| | 0.00/1.13k [00:00 This pipeline is called zero-shot because you don’t need to fine-tune the model on your data to use it. It can directly return probability scores for any list of labels you want!\n", + "\n", + "### Text Generation\n", + "\n", + "> Now let’s see how to use a pipeline to generate some text. The main idea here is that you provide a prompt and the model will auto-complete it by generating the remaining text. This is similar to the predictive text feature that is found on many phones. Text generation involves randomness, so it’s normal if you don’t get the same results as shown below.\n", + "\n", + "**Note** The default `gpt2` model does not seem to work on HiPerGator at this time." + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "id": "4de613df-9ca1-437b-b70b-cda8fc2cfde4", + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.\n" + ] + }, + { + "data": { + "text/plain": [ + "[{'generated_text': 'In this course, we will teach you how to learn more skills of communication in writing. Topics include: Communication and writing writing\\n\\n\\n\\n\\n'},\n", + " {'generated_text': 'In this course, we will teach you how to build an efficient web server based on your users base and your user base. The first part will show'}]" + ] + }, + "execution_count": 6, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "generator = pipeline(\"text-generation\", model=\"distilgpt2\")\n", + "generator(\n", + " \"In this course, we will teach you how to\",\n", + " max_length=30,\n", + " num_return_sequences=2,\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "01b9123d-4bf4-449b-aed3-dfdba7e34571", + "metadata": {}, + "source": [ + "### Mask Filling" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "id": "2fa906cf-d404-4f53-b92b-cf563773b7ed", + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "No model was supplied, defaulted to distilroberta-base (https://huggingface.co/distilroberta-base)\n" + ] + }, + { + "data": { + "application/vnd.jupyter.widget-view+json": { + "model_id": "8b249b63276947ca90f845c2ce57d83e", + "version_major": 2, + "version_minor": 0 + }, + "text/plain": [ + "Downloading: 0%| | 0.00/480 [00:00 models.\", top_k=2)" + ] + }, + { + "cell_type": "markdown", + "id": "6f5c4d09-1813-4a7a-ad07-d3ffd75a40f1", + "metadata": {}, + "source": [ + "### Named Entity Recognition (NER)" + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "id": "e90aba8e-c9da-4f13-976c-742b9975f8e8", + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english)\n", + "/apps/nlp/1.2/lib/python3.8/site-packages/transformers/pipelines/token_classification.py:128: UserWarning: `grouped_entities` is deprecated and will be removed in version v5.0.0, defaulted to `aggregation_strategy=\"AggregationStrategy.SIMPLE\"` instead.\n", + " warnings.warn(\n" + ] + }, + { + "data": { + "text/plain": [ + "[{'entity_group': 'PER',\n", + " 'score': 0.99930125,\n", + " 'word': 'Matt',\n", + " 'start': 11,\n", + " 'end': 15},\n", + " {'entity_group': 'ORG',\n", + " 'score': 0.997988,\n", + " 'word': 'University of Florida',\n", + " 'start': 34,\n", + " 'end': 55},\n", + " {'entity_group': 'LOC',\n", + " 'score': 0.9874652,\n", + " 'word': 'Gainesville',\n", + " 'start': 59,\n", + " 'end': 70}]" + ] + }, + "execution_count": 9, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "ner = pipeline(\"ner\", grouped_entities=True)\n", + "ner(\"My name is Matt and I work at the University of Florida in Gainesville.\")" + ] + }, + { + "cell_type": "markdown", + "id": "c309d54b-4b15-40d6-9503-1fb83fe5089d", + "metadata": {}, + "source": [ + "### Question answering" + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "id": "6f75bba8-7733-41f2-b597-98370ae633ba", + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "No model was supplied, defaulted to distilbert-base-cased-distilled-squad (https://huggingface.co/distilbert-base-cased-distilled-squad)\n" + ] + }, + { + "data": { + "text/plain": [ + "{'score': 0.4767897427082062,\n", + " 'start': 34,\n", + " 'end': 55,\n", + " 'answer': 'University of Florida'}" + ] + }, + "execution_count": 11, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "question_answerer = pipeline(\"question-answering\")\n", + "question_answerer(\n", + " question=\"Where do I work?\",\n", + " context=\"My name is Matt and I work at the University of Florida in Gainesville.\",\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "29eff250-b146-4167-a623-8ae5b89d990b", + "metadata": {}, + "source": [ + "### Summarization" + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "id": "fad99d04-0927-4f66-b816-768c8dd10ec7", + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 (https://huggingface.co/sshleifer/distilbart-cnn-12-6)\n" + ] + }, + { + "data": { + "text/plain": [ + "[{'summary_text': ' America has changed dramatically during recent years . The number of engineering graduates in the U.S. has declined in traditional engineering disciplines such as mechanical, civil, electrical, chemical, and aeronautical engineering . Rapidly developing economies such as China and India continue to encourage and advance the teaching of engineering .'}]" + ] + }, + "execution_count": 13, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "summarizer = pipeline(\"summarization\")\n", + "summarizer(\n", + " \"\"\"\n", + " America has changed dramatically during recent years. Not only has the number of \n", + " graduates in traditional engineering disciplines such as mechanical, civil, \n", + " electrical, chemical, and aeronautical engineering declined, but in most of \n", + " the premier American universities engineering curricula now concentrate on \n", + " and encourage largely the study of engineering science. As a result, there \n", + " are declining offerings in engineering subjects dealing with infrastructure, \n", + " the environment, and related issues, and greater concentration on high \n", + " technology subjects, largely supporting increasingly complex scientific \n", + " developments. While the latter is important, it should not be at the expense \n", + " of more traditional engineering.\n", + "\n", + " Rapidly developing economies such as China and India, as well as other \n", + " industrial countries in Europe and Asia, continue to encourage and advance \n", + " the teaching of engineering. Both China and India, respectively, graduate \n", + " six and eight times as many traditional engineers as does the United States. \n", + " Other industrial countries at minimum maintain their output, while America \n", + " suffers an increasingly serious decline in the number of engineering graduates \n", + " and a lack of well-educated engineers.\n", + "\"\"\"\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "c40c6c50-f366-4cd2-8432-c559bbee6b54", + "metadata": {}, + "source": [ + "### Translation" + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "id": "2f197859-2880-477c-8572-60df2459abd5", + "metadata": {}, + "outputs": [ + { + "data": { + "application/vnd.jupyter.widget-view+json": { + "model_id": "3817f88ee111468a8ec7320334fbd5aa", + "version_major": 2, + "version_minor": 0 + }, + "text/plain": [ + "Downloading: 0%| | 0.00/1.26k [00:00 All the Transformer models mentioned above (GPT, BERT, BART, T5, etc.) have been trained as language models. This means they have been trained on large amounts of raw text in a self-supervised fashion. **Self-supervised learning is a type of training in which the objective is automatically computed from the inputs of the model. That means that humans are not needed to label the data!**\n", + "> \n", + "> This type of model develops a statistical understanding of the language it has been trained on, but it’s not very useful for specific practical tasks. Because of this, the general pretrained model then goes through a process called transfer learning. During this process, the model is fine-tuned in a supervised way — that is, using human-annotated labels — on a given task.\n", + "\n", + "### Transformers are **BIG** models\n", + "\n", + "The graph below is from the article [*We Might See A 100T Language Model In 2022*](https://analyticsindiamag.com/we-might-see-a-100t-language-model-in-2022/), published in Dec 2021:\n", + "![Image of transformer model size over time, from https://analyticsindiamag.com/we-might-see-a-100t-language-model-in-2022/](images/NVIDIA_NLP_Model_Size.png)\n", + "\n", + "And costly in terms of compute and CO2 emmisions...\n", + "\n", + "![Relative CO2 emissions for a variety of human activities, from Hugging Face](images/carbon_footprint_HuggingFace.png)\n", + "\n", + "Luckily, as we've seen, transfer learning can help!!\n", + "\n", + "![Schematic of pre-training from Hugging Face](images/pretraining_HuggingFace.png)\n", + "\n", + "> This pretraining is usually done on very large amounts of data. Therefore, it requires a very large corpus of data, and training can take up to several weeks.\n", + ">\n", + "> *Fine-tuning*, on the other hand, is the training done **after** a model has been pretrained. To perform fine-tuning, you first acquire a pretrained language model, then perform additional training with a dataset specific to your task." + ] + }, + { + "cell_type": "markdown", + "id": "56410ead-8669-4c8e-b543-155f20025030", + "metadata": {}, + "source": [ + "## Transformer architecture\n", + "\n", + "> The model is primarily composed of two blocks:\n", + "> * **Encoder (left)**: The encoder receives an input and builds a representation of it (its features). This means that the model is optimized to acquire understanding from the input.\n", + "> * **Decoder (right)**: The decoder uses the encoder’s representation (features) along with other inputs to generate a target sequence. This means that the model is optimized for generating outputs.\n", + "\n", + "![Basic transformer architecture, from Hugging Face](images/transformers_blocks_HuggingFace.png)\n", + "\n", + "> Each of these parts can be used independently, depending on the task:\n", + "> * **Encoder-only models**: Good for tasks that require understanding of the input, such as sentence classification and named entity recognition.\n", + "> * **Decoder-only models**: Good for generative tasks such as text generation.\n", + "> * **Encoder-decoder models** or **sequence-to-sequence models**: Good for generative tasks that require an input, such as translation or summarization." + ] + }, + { + "cell_type": "markdown", + "id": "5fbc7ea7-b278-4d1d-8ca1-7d0e82a9f1b0", + "metadata": {}, + "source": [ + "## Attention layers\n", + "\n", + "> A key feature of Transformer models is that they are built with special layers called attention layers. In fact, the title of the paper introducing the Transformer architecture was [\"Attention Is All You Need\"](https://arxiv.org/abs/1706.03762)! ...for now, all you need to know is that this layer will tell the model to pay specific attention to certain words in the sentence you passed it (and more or less ignore the others) when dealing with the representation of each word.\n", + ">\n", + "> To put this into context, consider the task of translating text from English to French. Given the input “You like this course”, a translation model will need to also attend to the adjacent word “You” to get the proper translation for the word “like”, because in French the verb “like” is conjugated differently depending on the subject. The rest of the sentence, however, is not useful for the translation of that word. In the same vein, when translating “this” the model will also need to pay attention to the word “course”, because “this” translates differently depending on whether the associated noun is masculine or feminine. Again, the other words in the sentence will not matter for the translation of “this”. With more complex sentences (and more complex grammar rules), the model would need to pay special attention to words that might appear farther away in the sentence to properly translate each word.\n", + ">\n", + "> The same concept applies to any task associated with natural language: a word by itself has a meaning, but that meaning is deeply affected by the context, which can be any other word (or words) before or after the word being studied.\n", + "> \n", + "> Now that you have an idea of what attention layers are all about, let’s take a closer look at the Transformer architecture.\n", + "\n", + "## The original architecture\n", + "\n", + "> The Transformer architecture was originally designed for translation. During training, the **encoder** receives inputs (sentences) in a certain language, while the **decoder** receives the same sentences in the desired target language. In the encoder, the attention layers can use all the words in a sentence (since, as we just saw, the translation of a given word can be dependent on what is after as well as before it in the sentence). The decoder, however, works sequentially and can only pay attention to the words in the sentence that it has already translated (so, only the words before the word currently being generated). For example, when we have predicted the first three words of the translated target, we give them to the decoder which then uses all the inputs of the encoder to try to predict the fourth word.\n", + ">\n", + "> To speed things up during training (when the model has access to target sentences), the decoder is fed the whole target, but it is not allowed to use future words (if it had access to the word at position 2 when trying to predict the word at position 2, the problem would not be very hard!). For instance, when trying to predict the fourth word, the attention layer will only have access to the words in positions 1 to 3.\n", + "> \n", + "> The original Transformer architecture looked like this, with the encoder on the left and the decoder on the right:\n", + "\n", + "![Architecture of a Transformers models](images/transformers.png)\n", + "\n", + "> Note that the the first attention layer in a decoder block pays attention to all (past) inputs to the decoder, but the second attention layer uses the output of the encoder. It can thus access the whole input sentence to best predict the current word. This is very useful as different languages can have grammatical rules that put the words in different orders, or some context provided later in the sentence may be helpful to determine the best translation of a given word.\n", + "> \n", + "> The attention mask can also be used in the encoder/decoder to prevent the model from paying attention to some special words — for instance, the special padding word used to make all the inputs the same length when batching together sentences.\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "id": "c9deec72-873e-427d-81de-4d448c3020ca", + "metadata": {}, + "source": [ + "## Biases and Limiatations of NLP Models\n", + "\n", + "> If your intent is to use a pretrained model or a fine-tuned version in production, please be aware that, while these models are powerful tools, they come with limitations. The biggest of these is that, to enable pretraining on large amounts of data, researchers often scrape all the content they can find, taking the best as well as the worst of what is available on the internet.\n", + ">\n", + "> To give a quick illustration, let’s go back the example of a fill-mask pipeline with the BERT model:" + ] + }, + { + "cell_type": "code", + "execution_count": 15, + "id": "54412caa-fbfb-4d00-a370-12eae4a80180", + "metadata": {}, + "outputs": [ + { + "data": { + "application/vnd.jupyter.widget-view+json": { + "model_id": "8dab4351382c42b7832da07fe4df51ab", + "version_major": 2, + "version_minor": 0 + }, + "text/plain": [ + "Downloading: 0%| | 0.00/570 [00:00