diff --git a/rel_ext_01_task.ipynb b/rel_ext_01_task.ipynb index bc970b3..4659889 100644 --- a/rel_ext_01_task.ipynb +++ b/rel_ext_01_task.ipynb @@ -68,11 +68,11 @@ "\n", "An obvious way to start is to write down a few patterns which express each relation. For example, we can use the pattern \"X is the founder of Y\" to find new instances of the `founders` relation. If we search a large corpus, we may find the phrase \"Elon Musk is the founder of SpaceX\", which we can use as evidence for the relational triple `(founders, SpaceX, Elon_Musk)`.\n", "\n", - "Unfortunately, this approach doesn't get us very far. The central challenge of relation extraction is the fantastic diversity of language, the multitude of possible ways to express a given relation. For example, each of the following sentences expressed the relational triple `(founders, SpaceX, Elon_Musk)`:\n", + "Unfortunately, this approach doesn't get us very far. The central challenge of relation extraction is the fantastic diversity of language, the multitude of possible ways to express a given relation. For example, each of the following sentences expresses the relational triple `(founders, SpaceX, Elon_Musk)`:\n", "\n", - "- \"You may also be thinking of _Elon Musk_ (founder of _SpaceX_), who started PayPal.\"\n", - "- \"Interesting Fact: _Elon Musk_, co-founder of PayPal, went on to establish _SpaceX_, one of the most promising space travel startups in the world.\"\n", - "- \"If Space Exploration (_SpaceX_), founded by Paypal pioneer _Elon Musk_ succeeds, commercial advocates will gain credibility and more support in Congress.\"\n", + "- \"You may also be thinking of *Elon Musk* (founder of *SpaceX*), who started PayPal.\"\n", + "- \"Interesting Fact: *Elon Musk*, co-founder of PayPal, went on to establish *SpaceX*, one of the most promising space travel startups in the world.\"\n", + "- \"If Space Exploration (*SpaceX*), founded by Paypal pioneer *Elon Musk* succeeds, commercial advocates will gain credibility and more support in Congress.\"\n", "\n", "The patterns which connect \"Elon Musk\" with \"SpaceX\" in these examples are not ones we could have easily anticipated. To do relation extraction effectively, we need to go beyond hand-built patterns.\n", "\n", @@ -80,7 +80,7 @@ "\n", "Effective relation extraction will require applying machine learning methods. The natural place to start is with supervised learning. This means training an extraction model from a dataset of examples which have been labeled with the target output. Sentences like the three examples above would be annotated with the `founders` relation, but we'd also have sentences which include \"Elon Musk\" and \"SpaceX\" but do not express the `founders` relation, such as:\n", "\n", - "- \"Billionaire entrepreneur _Elon Musk_ announced the latest addition to the _SpaceX_ arsenal: the 'Big F---ing Rocket' (BFR)\".\n", + "- \"Billionaire entrepreneur *Elon Musk* announced the latest addition to the *SpaceX* arsenal: the 'Big F---ing Rocket' (BFR)\".\n", "\n", "Such \"negative examples\" would be labeled as such, and the fully-supervised model would then be able to learn from both positive and negative examples the linguistic patterns that indicate each relation.\n", "\n", @@ -94,7 +94,7 @@ "\n", "The second limitation is that we need an existing KB to start from. We can only train a model to extract new instances of the `founders` relation if we already have many instances of the `founders` relation. Thus, while distant supervision is a great way to extend an existing KB, it's not useful for creating a KB containing new relations from scratch.\n", "\n", - "\\[ [top](#Relation-extraction-using-distant-supervision) \\]" + "\\[ [top](#Relation-extraction-using-distant-supervision:-task-definition) \\]" ] }, { @@ -146,7 +146,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "\\[ [top](#Relation-extraction-using-distant-supervision) \\]" + "\\[ [top](#Relation-extraction-using-distant-supervision:-task-definition) \\]" ] }, { @@ -169,7 +169,7 @@ "\n", "Now, in order to do relation extraction, we actually need _pairs_ of entity mentions, and it's important to have the context around and between the two mentions. Fortunately, UMass has provided an [expanded version of Wikilinks](http://www.iesl.cs.umass.edu/data/data-wiki-links) which includes the context around each entity mention. We've written code to stitch together pairs of entity mentions along with their contexts, and we've filtered the examples extensively. The result is a compact corpus suitable for our purposes.\n", "\n", - "Because we're frequently going to want to retrieve corpus examples containing specific entities, it will be convenient to create a `Corpus` class which holds not only the examples themselves, but also a precomputed index. Let's take a closer look." + "Because we're frequently going to want to retrieve corpus examples containing specific entities, we've created a `Corpus` class which holds not only the examples themselves, but also a precomputed index. Let's take a closer look." ] }, { @@ -361,7 +361,7 @@ "\n", "One thing this corpus does _not_ include is any annotation about relations. Thus, it could not be used for the fully-supervised approach to relation extraction, because the fully-supervised approach requires that each pair of entity mentions be annotated with the relation (if any) that holds between the two entities. In order to make any headway, we'll need to connect the corpus with an external source of knowledge about relations. We need a knowledge base.\n", "\n", - "\\[ [top](#Relation-extraction-using-distant-supervision) \\]" + "\\[ [top](#Relation-extraction-using-distant-supervision:-task-definition) \\]" ] }, { @@ -377,7 +377,7 @@ "source": [ "The data distribution for this unit includes a _knowledge base_ (KB) ultimately derived from [Freebase](https://en.wikipedia.org/wiki/Freebase). Unfortunately, Freebase was shut down in 2016, but the Freebase data is still available from various sources and in various forms. The KB included here was extracted from the [Freebase Easy data dump](http://freebase-easy.cs.uni-freiburg.de/dump/).\n", "\n", - "The KB is a collection of _relational triples_, each consisting of a _relation_, a _subject_, and an _object_. For example, here are three triples from the KB:\n", + "The KB is a collection of *relational triples*, each consisting of a *relation*, a *subject*, and an *object*. For example, here are three triples from the KB:\n", "\n", "```\n", "(place_of_birth, Barack_Obama, Honolulu)\n", @@ -390,9 +390,7 @@ "- The relation is one of a handful of predefined constants, such as `place_of_birth` or `has_spouse`.\n", "- The subject and object are entities represented by Wiki IDs (that is, suffixes of Wikipedia URLs).\n", "\n", - "Let's write some code to read the KB so that we can take a closer look.\n", - "\n", - "Now, just as we did for the corpus, we'll create a `KB` class to store the KB triples and some associated indexes. We'll want to be able to look up KB triples both by relation and by entities, so we'll create indexes for both of those access patterns." + "Now, just as we did for the corpus, we've created a `KB` class to store the KB triples and some associated indexes. This class makes it easy and efficient to look up KB triples both by relation and by entities." ] }, { @@ -747,7 +745,7 @@ "\n", "In fact, the whole point of developing methods for automatic relation extraction is to extend existing KBs (and build new ones) by identifying new relational triples from natural language text. If our KBs were complete, we wouldn't have anything to do.\n", "\n", - "\\[ [top](#Relation-extraction-using-distant-supervision) \\]" + "\\[ [top](#Relation-extraction-using-distant-supervision:-task-definition) \\]" ] }, { @@ -759,8 +757,8 @@ "With our data assets in hand, it's time to provide a precise formulation of the prediction problem we aim to solve. We need to specify:\n", "\n", "- What is the input to the prediction?\n", - " - Is it a specific pair of entity _mentions_ in a specific context?\n", - " - Or is it a pair of _entities_, apart from any specific mentions?\n", + " - Is it a specific pair of entity *mentions* in a specific context?\n", + " - Or is it a pair of *entities*, apart from any specific mentions?\n", "- What is the output of the prediction?\n", " - Do we need to predict at most one relation label? (This is [multi-class classification](https://en.wikipedia.org/wiki/Multiclass_classification).)\n", " - Or can we predict multiple relation labels? (This is [multi-label classification](https://en.wikipedia.org/wiki/Multi-label_classification).)\n", @@ -769,10 +767,12 @@ "\n", "In order to leverage the distant supervision paradigm, we'll need to connect information in the corpus with information in the KB. There are two possibilities, depending on how we formulate our prediction problem:\n", "\n", - "- __Use the KB to generate labels for the corpus.__ If our problem is to classify a pair of entity _mentions_ in a specific example in the corpus, then we can use the KB to provide labels for training examples. Labeling specific examples is how the fully supervised paradigm works, so it's the obvious way to think about leveraging distant supervision as well. Although it can be made to work, it's not actually the preferred approach.\n", - "- __Use the corpus to generate features for entity pairs.__ If instead our problem is to classify a pair of _entities_, then we can use all the examples from the corpus where those two entities co-occur to generate a feature representation describing the entity pair. This is the approach taken by [Mintz et al. 2009](https://www.aclweb.org/anthology/P09-1113), and it's the approach we'll pursue here.\n", + "- __Use the KB to generate labels for the corpus.__ If our problem is to classify a pair of entity *mentions* in a specific example in the corpus, then we can use the KB to provide labels for training examples. Labeling specific examples is how the fully supervised paradigm works, so it's the obvious way to think about leveraging distant supervision as well. Although it can be made to work, it's not actually the preferred approach.\n", + "- __Use the corpus to generate features for entity pairs.__ If instead our problem is to classify a pair of *entities*, then we can use all the examples from the corpus where those two entities co-occur to generate a feature representation describing the entity pair. This is the approach taken by [Mintz et al. 2009](https://www.aclweb.org/anthology/P09-1113), and it's the approach we'll pursue here.\n", + "\n", + "So we'll formulate our prediction problem such that the input is a pair of entities, and the goal is to predict what relation(s) the pair belongs to. The KB will provide the labels, and the corpus will provide the features.\n", "\n", - "So we'll formulate our prediction problem such that the input is a pair of entities, and the goal is to predict what relation(s) the pair belongs to. The KB will provide the labels, and the corpus will provide the features." + "We've created a `Dataset` class which combines a corpus and a KB, and provides a variety of convenience methods for the dataset." ] }, { @@ -997,7 +997,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "\\[ [top](#Relation-extraction-using-distant-supervision) \\]" + "\\[ [top](#Relation-extraction-using-distant-supervision:-task-definition) \\]" ] }, { @@ -1179,7 +1179,7 @@ "\n", "Actually, the most remarkable result in this table is the comparatively good performance for the `contains` relation! What does this result tell us about the data?\n", "\n", - "\\[ [top](#Relation-extraction-using-distant-supervision) \\]" + "\\[ [top](#Relation-extraction-using-distant-supervision:-task-definition) \\]" ] }, { @@ -1341,7 +1341,9 @@ "- Some of the most frequent middles are natural and intuitive. For example, \", son of\" indicates a forward `parents` relation, while \"and his son\" indicates a reverse `parents` relation.\n", "- Punctuation and stop words such as \"and\" and \"of\" are extremely common. Unlike some other NLP applications, it's probably a bad idea to throw these away — they carry lots of useful information.\n", "- However, punctuation and stop words tend to be highly ambiguous. For example, a bare comma is a likely middle for almost every relation in at least one direction.\n", - "- A few of the results reflect quirks of the dataset. For example, the appearance of the phrase \"in 1994 , he became a central figure in the\" as a common middle for the `genre` relation reflects both the relative scarcity of examples for that relation, and an unfortunate tendency of the Wikilinks dataset to include duplicate or near-duplicate source documents. (That middle connects the entities [Ready to Die](https://en.wikipedia.org/wiki/Ready_to_Die) — the first studio album by the Notorious B.I.G. — and [East Coast hip hop](https://en.wikipedia.org/wiki/East_Coast_hip_hop).)" + "- A few of the results reflect quirks of the dataset. For example, the appearance of the phrase \"in 1994 , he became a central figure in the\" as a common middle for the `genre` relation reflects both the relative scarcity of examples for that relation, and an unfortunate tendency of the Wikilinks dataset to include duplicate or near-duplicate source documents. (That middle connects the entities [Ready to Die](https://en.wikipedia.org/wiki/Ready_to_Die) — the first studio album by the Notorious B.I.G. — and [East Coast hip hop](https://en.wikipedia.org/wiki/East_Coast_hip_hop).)\n", + "\n", + "Now it's straightforward task to build and evaluate a classifier which predicts `True` for a candidate `KBTriple` just in case its entities appear in the corpus connected by one of the phrases we just discovered." ] }, { @@ -1428,7 +1430,7 @@ "\n", "__Question:__ What's the optimal value for `top_k`, the number of most frequent middles to consider? What choice maximizes our chosen figure of merit, the macro-averaged F0.5-score?\n", "\n", - "\\[ [top](#Relation-extraction-using-distant-supervision) \\]" + "\\[ [top](#Relation-extraction-using-distant-supervision:-task-definition) \\]" ] } ],