diff --git a/hw_sentiment.ipynb b/hw_sentiment.ipynb index a29fe0c..d6c852c 100644 --- a/hw_sentiment.ipynb +++ b/hw_sentiment.ipynb @@ -23,16 +23,10 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/cgpotts/cs224u/blob/master/hw_openqa.ipynb)\n", + "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/cgpotts/cs224u/blob/main/hw_sentiment.ipynb)\n", + "[![Open in SageMaker Studio Lab](https://studiolab.sagemaker.aws/studiolab.svg)](https://studiolab.sagemaker.aws/import/github/cgpotts/cs224u/blob/main/hw_sentiment.ipynb)\n", "\n", - "If colab is opened with this badge, please **save a copy to drive** (from the 'File' menu) before running the notebook." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "[![Open in SageMaker Studio Lab](https://studiolab.sagemaker.aws/studiolab.svg)](https://studiolab.sagemaker.aws/import/github/cgpotts/cs224u/blob/master/hw_openqa.ipynb)" + "If Colab is opened with this badge, please **save a copy to drive** (from the File menu) before running the notebook." ] }, { @@ -53,7 +47,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "The homework questions ask you to implement some baseline system using DynaSent Round 1, DynaSent Round 2, and the Stanford Sentiment Treebank. The bakeoff challenge is to define a system that does well on the DynaSent test sets, the SST-3 test set, and a set of mystery examples that don't correspond to the DynaSent or SST-3 domains." + "The homework questions ask you to implement some baseline systems using DynaSent Round 1, DynaSent Round 2, and the Stanford Sentiment Treebank. The bakeoff challenge is to define a system that does well on the DynaSent test sets, the SST-3 test set, and a set of mystery examples that don't correspond to the DynaSent or SST-3 domains." ] }, { @@ -104,7 +98,7 @@ }, { "cell_type": "code", - "execution_count": 4, + "execution_count": null, "metadata": { "id": "pyAzJmyYSNMP" }, @@ -302,7 +296,7 @@ "outputs": [], "source": [ "def print_label_dist(dataset, labelname='gold_label', splitnames=('train', 'validation')):\n", - " for splitname in splitnames: \n", + " for splitname in splitnames:\n", " print(splitname)\n", " dist = sorted(Counter(dataset[splitname][labelname]).items())\n", " for k, v in dist:\n", @@ -339,7 +333,7 @@ "id": "p4WFt0C6J8hP" }, "source": [ - "DynaSent Round 2 was created using different methods than Round 1. For Round 2, crowdworkers edited sentences from the Yelp Academic Dataset seeking to achieve a particular sentiment goal (e.g., expressing a positive sentiment) while fooling a top-performing model. This work was done on the [Dynabench](https://dynabench.org) platform. The hope is that this directly adversarial objective will lead to examples that are very hard for present-day models but intuitive for humans. All the examples were multiply-labeled by separate annotators." + "DynaSent Round 2 was created using different methods than Round 1. For Round 2, crowdworkers edited sentences from the Yelp Academic Dataset seeking to achieve a particular sentiment goal (e.g., expressing a positive sentiment) while fooling a top-performing model. This work was done on the [Dynabench](https://dynabench.org) platform. The hope is that this directly adversarial goal will lead to examples that are very hard for present-day models but intuitive for humans. All the examples were multiply-labeled by separate annotators." ] }, { @@ -434,7 +428,7 @@ "id": "qeONNIJQJ8hP" }, "source": [ - "The [Stanford Sentiment Treebank (SST)](http://nlp.stanford.edu/sentiment/) of [Socher et al. 2013](https://aclanthology.org/D13-1170/) is a widely-used resource for evaluating supervised NLU models. It consists of sentences from Rotten Tomatoes Movie Reviews. We will use the ternary version of the task (SST-3)." + "The [Stanford Sentiment Treebank (SST)](http://nlp.stanford.edu/sentiment/) of [Socher et al. 2013](https://aclanthology.org/D13-1170/) is a widely-used resource for evaluating supervised models. It consists of sentences from Rotten Tomatoes Movie Reviews (see [Pang and Lee's project page](https://www.cs.cornell.edu/home/llee/papers/pang-lee-stars.home.html)). We will use the ternary version of the task (SST-3)." ] }, { @@ -696,19 +690,19 @@ "outputs": [], "source": [ "def unigrams_phi(s):\n", - " \"\"\"The basis for a bigrams feature function. \n", - " \n", + " \"\"\"The basis for a unigrams feature function.\n", + "\n", " Downcases all tokens.\n", "\n", " Parameters\n", " ----------\n", - " text : str\n", + " s : str\n", " The example to represent\n", "\n", " Returns\n", " -------\n", " Counter\n", - " A map from tuples to their counts in `text`\n", + " A map from tokens (str) to their counts in `text`\n", "\n", " \"\"\"\n", " return Counter(s.lower().split())" @@ -727,7 +721,7 @@ "metadata": {}, "outputs": [], "source": [ - "unigrams_phi(\"Here's an example with an emoticon :)\")" + "unigrams_phi(\"Here's an example with an emoticon :)!\")" ] }, { @@ -1113,7 +1107,7 @@ }, { "cell_type": "code", - "execution_count": 5, + "execution_count": null, "metadata": {}, "outputs": [], "source": [ @@ -1122,13 +1116,13 @@ "\n", "def tweetgrams_phi(s, **kwargs):\n", " \"\"\"The basis for a feature function using `TweetTokenizer`.\n", - " \n", + "\n", " Parameters\n", " ----------\n", " s : str\n", " kwargs : dict\n", " Passed to `TweetTokenizer`\n", - " \n", + "\n", " Returns\n", " -------\n", " Counter\n", @@ -1150,7 +1144,7 @@ }, { "cell_type": "code", - "execution_count": 6, + "execution_count": null, "metadata": {}, "outputs": [], "source": [ @@ -1185,17 +1179,9 @@ }, { "cell_type": "code", - "execution_count": 7, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "All tests passed for `tweetgrams_phi`\n" - ] - } - ], + "execution_count": null, + "metadata": {}, + "outputs": [], "source": [ "test_tweetgrams_phi(tweetgrams_phi)" ] @@ -1239,16 +1225,16 @@ "source": [ "def train_linear_model(model, featfunc, train_dataset):\n", " \"\"\"Train an sklearn classifier.\n", - " \n", + "\n", " Parameters\n", " ----------\n", " model : sklearn classifier model\n", " featfunc : func\n", - " Maps strings to Counter instances.\n", + " Maps strings to Counter instances\n", " train_dataset: dict\n", " Must have a key \"sentence\" containing strings that `featfunc` \n", - " will process, and a key \"gold_label\" giving labels.\n", - " \n", + " will process, and a key \"gold_label\" giving labels\n", + "\n", " Returns\n", " -------\n", " tuple\n", @@ -1258,20 +1244,21 @@ " \"\"\"\n", " pass\n", " # Step 1: Featurize all the examples in `train_dataset['sentence']`\n", - " ##### YOUR CODE HERE \n", + " ##### YOUR CODE HERE\n", + "\n", + "\n", "\n", - " \n", " # Step 2: Instantiate and use a `DictVectorizer`:\n", " ##### YOUR CODE HERE\n", "\n", "\n", - " \n", + "\n", " # Step 3: Train the model on the feature matrix and\n", " # train_dataset['gold_label']:\n", " ##### YOUR CODE HERE\n", "\n", "\n", - " \n", + "\n", " # Step 4: Return (model, vectorizer):\n", " ##### YOUR CODE HERE\n", "\n", @@ -1300,7 +1287,7 @@ " model = LogisticRegression()\n", " result = func(model, featfunc, train_dataset)\n", " if not isinstance(result, tuple) or len(result) != 2:\n", - " print(f\"Error for `{func.__name__}` incorrect return type\")\n", + " print(f\"Error for `{func.__name__}`: Incorrect return type\")\n", " return\n", " model, vectorizer = result\n", " if not hasattr(vectorizer, 'vocabulary_'):\n", @@ -1310,7 +1297,7 @@ " if not hasattr(model, 'classes_'):\n", " print(f\"Error for `{func.__name__}`: \"\n", " f\"First return value is not a trained classifier\")\n", - " return 1\n", + " return\n", " print(f\"No errors found for `{func.__name__}`\")" ] }, @@ -1327,7 +1314,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "You can now very easily train models on our datasets. Quick example (this shouldn't take more than a couple of minutes to run):" + "You can now very easily train models on our datasets. Quick example (this shouldn't take more than a couple of minutes to run even on a CPU):" ] }, { @@ -1367,37 +1354,38 @@ "source": [ "def assess_linear_model(model, featfunc, vectorizer, assess_dataset):\n", " \"\"\"Assess a trained sklearn model.\n", - " \n", + "\n", " Parameters\n", " ----------\n", " model: trained sklearn model\n", " featfunc : func\n", - " Maps strings to count dicts.\n", + " Maps strings to count dicts\n", " vectorizer : fitted DictVectorizer\n", " assess_dataset: dict\n", " Must have a key \"sentence\" containing strings that `featfunc` \n", - " will process, and a key \"gold_label\" giving labels.\n", - " \n", + " will process, and a key \"gold_label\" giving labels\n", + "\n", " Returns\n", " -------\n", " A classification report (multiline string)\n", - " \n", + "\n", " \"\"\"\n", " pass\n", " # Step 1: Featurize the assessment data:\n", " ##### YOUR CODE HERE\n", "\n", - " \n", + "\n", + "\n", " # Step 2: Vectorize the assessment data features:\n", " ##### YOUR CODE HERE\n", "\n", "\n", - " \n", + "\n", " # Step 3: Make predictions:\n", " ##### YOUR CODE HERE\n", "\n", "\n", - " \n", + "\n", " # Step 4: Return a classification report (str):\n", " ##### YOUR CODE HERE\n", "\n", @@ -1428,16 +1416,16 @@ " return Counter(s.split())\n", " model = LogisticRegression()\n", " model, vectorizer = trainfunc(model, featfunc, train_dataset)\n", - " result = assessfunc(model, featfunc, vectorizer, assess_dataset) \n", + " result = assessfunc(model, featfunc, vectorizer, assess_dataset)\n", " errcount = 0\n", " if len(vectorizer.vocabulary_) != 2:\n", - " print(\"Error for `{assessfunc.__name__}`: Unexpected feature count\")\n", + " print(f\"Error for `{assessfunc.__name__}`: Unexpected feature count\")\n", " errcount += 1\n", " if 'weighted avg' not in result:\n", - " print(\"Error for `{assessfunc.__name__}`: Unexpected return value\")\n", + " print(f\"Error for `{assessfunc.__name__}`: Unexpected return value\")\n", " errcount += 1\n", " if errcount == 0:\n", - " print(f\"No errors found for `{assessfunc.__name__}`\") " + " print(f\"No errors found for `{assessfunc.__name__}`\")" ] }, { @@ -1469,9 +1457,9 @@ "outputs": [], "source": [ "report = assess_linear_model(\n", - " lr_unigrams, \n", - " unigrams_phi, \n", - " vec_unigrams, \n", + " lr_unigrams,\n", + " unigrams_phi,\n", + " vec_unigrams,\n", " dynasent_r1['validation'])\n", "\n", "print(report)" @@ -1497,7 +1485,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "We'll use BERT-mini for the homework so that we can rapdily develop prototypes. You can then consider scaling up to larger models." + "We'll use BERT-mini (originally from [the BERT repo](https://github.com/google-research/bert)) for the homework so that we can rapdily develop prototypes. You can then consider scaling up to larger models." ] }, { @@ -1803,7 +1791,7 @@ "def get_batch_token_ids(batch, tokenizer):\n", " \"\"\"Map `batch` to a tensor of ids. The return\n", " value should meet the following specification:\n", - " \n", + "\n", " 1. The max length should be 512.\n", " 2. Examples longer than the max length should be truncated\n", " 3. Examples should be padded to the max length for the batch.\n", @@ -1811,20 +1799,21 @@ " token [SEP] should be added to the end.\n", " 5. The attention mask should be returned\n", " 6. The return value of each component should be a tensor. \n", - " \n", + "\n", " Parameters\n", " ----------\n", " batch: list of str\n", " tokenizer: Hugging Face tokenizer\n", - " \n", + "\n", " Returns\n", " -------\n", " dict with at least \"input_ids\" and \"attention_mask\" as keys,\n", " each with Tensor values\n", - " \n", + "\n", " \"\"\"\n", " pass\n", - " ##### YOUR CODE HERE \n", + " ##### YOUR CODE HERE\n", + "\n", "\n" ] }, @@ -1843,7 +1832,7 @@ "source": [ "def test_get_batch_token_ids(func):\n", " examples = [\n", - " \"Bert knows Snuffleupagus\", \n", + " \"Bert knows Snuffleupagus\",\n", " \"ELMo knew Bert.\",\n", " \"Buffalo \" * 520\n", " ]\n", @@ -1868,7 +1857,7 @@ " print(f\"Error for `{func.__name__}`: \"\n", " f\"Special tokens were not added\")\n", " if errcount == 0:\n", - " print(f\"No errors found for `{func.__name__}`\") " + " print(f\"No errors found for `{func.__name__}`\")" ] }, { @@ -1908,40 +1897,41 @@ "outputs": [], "source": [ "def get_reps(dataset, model, tokenizer, batchsize=20):\n", - " \"\"\"Represent each example in `dataset` with the \n", - " final hidden state above the [CLS] token.\n", - " \n", + " \"\"\"Represent each example in `dataset` with the final hidden state \n", + " above the [CLS] token.\n", + "\n", " Parameters\n", " ----------\n", " dataset : list of str\n", " model : BertModel\n", " tokenizer : BertTokenizerFast\n", " batchsize : int\n", - " \n", + "\n", " Returns\n", " -------\n", - " torch.Tensor with shape `(n_examples, dim)` where `dim` is the \n", - " dimensionality of the representations for `model`. \n", - " \n", - " \"\"\" \n", + " torch.Tensor with shape `(n_examples, dim)` where `dim` is the\n", + " dimensionality of the representations for `model`\n", + "\n", + " \"\"\"\n", " data = []\n", " with torch.no_grad():\n", + " pass\n", " # Iterate over `dataset` in batches:\n", " ##### YOUR CODE HERE\n", - " pass\n", "\n", - " \n", + "\n", + "\n", " # Encode the batch with `get_batch_token_ids`:\n", " ##### YOUR CODE HERE\n", "\n", "\n", - " \n", + "\n", " # Get the representations from the model, making\n", " # sure to pay attention to masking:\n", " ##### YOUR CODE HERE\n", "\n", "\n", - " \n", + "\n", " # Return a single tensor:\n", " ##### YOUR CODE HERE\n", "\n", @@ -1974,7 +1964,7 @@ " if round(result[0][0].item(), 2) != -0.64:\n", " print(f\"Error for `{func.__name__}`: \"\n", " f\"Representations seem to be incorrect\")\n", - " print(f\"No errors found for `{func.__name__}`\") " + " print(f\"No errors found for `{func.__name__}`\")" ] }, { @@ -2026,7 +2016,7 @@ " layer on top of that as the final output. The output of\n", " the dense layer should have the same dimensionality as the\n", " model input.\n", - " \n", + "\n", " Parameters\n", " ----------\n", " n_classes : int\n", @@ -2036,7 +2026,7 @@ " weights_name : str\n", " Name of pretrained model to load from Hugging Face\n", "\n", - " \"\"\" \n", + " \"\"\"\n", " super().__init__()\n", " self.n_classes = n_classes\n", " self.weights_name = weights_name\n", @@ -2056,34 +2046,34 @@ " # and we rely on the PyTorch loss function to add apply a\n", " # softmax to y. \n", " self.classifier_layer = None\n", - " ##### YOUR CODE HERE \n", + " ##### YOUR CODE HERE\n", + "\n", "\n", "\n", - " \n", " def forward(self, indices, mask):\n", " \"\"\"Process `indices` with `mask` by feeding these arguments\n", " to `self.bert` and then feeding the initial hidden state\n", " in `last_hidden_state` to `self.classifier_layer`.\n", - " \n", + "\n", " Parameters\n", " ----------\n", " indices : tensor.LongTensor of shape (n_batch, k)\n", - " Indices into the `self.bert` embedding layer. `n_batch` is \n", - " the number of examples and `k` is the sequence length for \n", + " Indices into the `self.bert` embedding layer. `n_batch` is\n", + " the number of examples and `k` is the sequence length for\n", " this batch\n", " mask : tensor.LongTensor of shape (n_batch, d)\n", - " Binary vector indicating which values should be masked. \n", - " `n_batch` is the number of examples and `k` is the \n", + " Binary vector indicating which values should be masked.\n", + " `n_batch` is the number of examples and `k` is the\n", " sequence length for this batch\n", - " \n", + "\n", " Returns\n", " -------\n", " tensor.FloatTensor\n", " Predicted values, shape `(n_batch, self.n_classes)`\n", - " \n", + "\n", " \"\"\"\n", " pass\n", - " ##### YOUR CODE HERE \n", + " ##### YOUR CODE HERE\n", "\n", "\n" ] @@ -2104,7 +2094,7 @@ "outputs": [], "source": [ "ids = get_batch_token_ids(\n", - " dynasent_r1['train']['sentence'][: 2], \n", + " dynasent_r1['train']['sentence'][: 2],\n", " bert_tokenizer)\n", "\n", "bert_module(ids['input_ids'], ids['attention_mask'])" @@ -2122,12 +2112,12 @@ " expected_activation = nn.ReLU()\n", " mod = moduleclass(expected_out, expected_activation)\n", " errcount = 0\n", - " \n", + "\n", " # Basic layer structure:\n", " if not hasattr(mod, \"classifier_layer\") or mod.classifier_layer is None:\n", " errcount += 1\n", " print(f\"Error for `{moduleclass.__name__}`: \"\n", - " f\"Missing attribute `classifier_layer`\") \n", + " f\"Missing attribute `classifier_layer`\")\n", " return \n", " for i in range(3):\n", " try:\n", @@ -2136,7 +2126,7 @@ " errcount += 1\n", " print(f\"Error for `{moduleclass.__name__}`: \"\n", " f\"`classifier_layer` is not an `nn.Sequential` \"\n", - " f\"and/or does not have the right structure\") \n", + " f\"and/or does not have the right structure\")\n", " # Correct first layer dimensionality:\n", " result_hidden = mod.classifier_layer[0].out_features\n", " if result_hidden != expected_hidden:\n", @@ -2211,7 +2201,7 @@ " def build_graph(self):\n", " return BertClassifierModule(\n", " self.n_classes_, self.hidden_activation, self.weights_name)\n", - " \n", + "\n", " def build_dataset(self, X, y=None):\n", " data = get_batch_token_ids(X, self.tokenizer)\n", " if y is None:\n", @@ -2270,7 +2260,7 @@ "%%time\n", "\n", "_ = bert_finetune.fit(\n", - " dynasent_r1['train']['sentence'], \n", + " dynasent_r1['train']['sentence'],\n", " dynasent_r1['train']['gold_label'])" ] }, @@ -2390,7 +2380,9 @@ "source": [ "The bakeoff dataset is available at \n", "\n", - "https://web.stanford.edu/class/cs224u/data/cs224u-sentiment-test-unlabeled.csv" + "https://web.stanford.edu/class/cs224u/data/cs224u-sentiment-test-unlabeled.csv\n", + "\n", + "This code should grab it for you and put it in `data/sentiment` if you are working in the cloud:" ] }, {