Initial commit

lemonhu · Dec 19, 2018 · b231bbc · b231bbc
1 parent 273b7f7
commit b231bbc
Show file tree

Hide file tree

Showing 25 changed files with 113,739 additions and 0 deletions.
diff --git a/README.md b/README.md
@@ -0,0 +1,104 @@
+# Convolutional Neural Network for Relation Extraction
+
+Pytorch Implementation of Deep Learning Approach for Relation Extraction Challenge([**SemEval-2010 Task #8: Multi-Way Classification of Semantic Relations Between Pairs of Nominals**](https://docs.google.com/document/d/1QO_CnmvNRnYwNWu1-QCAeR5ToQYkXUqFeAJbdEhsq7w/preview)) via Convolutional Neural Network.
+
+![Architecture](./img/Architecture.jpeg)
+
+## Requirements
+
+We recommend using python3 and a conda env.
+
+```shell
+source activate your_env
+pip install -r requirements.txt
+```
+
+## Data: SemEval-2010 Task #8
+
+- Given: a sentence marked with a pair of *nominals*
+- Goal: recognize the semantic relation between these nominals.
+- Example:
+  - "There were apples, <e1>**pears**</e1> and oranges in the <e2>**bowl**</e2>." 
+    => *Content-Container(e1,e2)*
+  - “The cup contained <e1>**tea**</e1> from dried <e2>**ginseng**</e2>.” 
+    => *Entity-Origin(e1,e2)*
+
+### The Inventory of Semantic Relations
+
+1. *Cause-Effect*: An event or object leads to an effect(those cancers were caused by radiation exposures)
+2. *Instrument-Agency*: An agent uses an instrument(phone operator)
+3. *Product-Producer*: A producer causes a product to exist (a factory manufactures suits)
+4. *Content-Container*: An object is physically stored in a delineated area of space (a bottle full of honey was weighed) Hendrickx, Kim, Kozareva, Nakov, O S´ eaghdha, Pad ´ o,´ Pennacchiotti, Romano, Szpakowicz Task Overview Data Creation Competition Results and Discussion The Inventory of Semantic Relations (III)
+5. *Entity-Origin*: An entity is coming or is derived from an origin, e.g., position or material (letters from foreign countries)
+6. *Entity-Destination*: An entity is moving towards a destination (the boy went to bed)
+7. *Component-Whole*: An object is a component of a larger whole (my apartment has a large kitchen)
+8. *Member-Collection*: A member forms a nonfunctional part of a collection (there are many trees in the forest)
+9. *Message-Topic*: An act of communication, written or spoken, is about a topic (the lecture was about semantics)
+10. *Other*: If none of the above nine relations appears to be suitable.
+
+### Distribution for Dataset
+
+|      Relation      |     Train Data      |      Test Data      |      Total Data      |
+| :----------------: | :-----------------: | :-----------------: | :------------------: |
+|    Cause-Effect    |   1,003 (12.54%)    |    328 (12.07%)     |    1331 (12.42%)     |
+| Instrument-Agency  |     504 (6.30%)     |     156 (5.74%)     |     660 (6.16%)      |
+|  Product-Producer  |     717 (8.96%)     |     231 (8.50%)     |     948 (8.85%)      |
+| Content-Container  |     540 (6.75%)     |     192 (7.07%)     |     732 (6.83%)      |
+|   Entity-Origin    |     716 (8.95%)     |     258 (9.50%)     |     974 (9.09%)      |
+| Entity-Destination |    845 (10.56%)     |    292 (10.75%)     |    1137 (10.61%)     |
+|  Component-Whole   |    941 (11.76%)     |    312 (11.48%)     |    1253 (11.69%)     |
+| Member-Collection  |     690 (8.63%)     |     233 (8.58%)     |     923 (8.61%)      |
+|   Message-Topic    |     634 (7.92%)     |     261 (9.61%)     |     895 (8.35%)      |
+|       Other        |   1,410 (17.63%)    |    454 (16.71%)     |    1864 (17.39%)     |
+|     **Total**      | **8,000 (100.00%)** | **2,717 (100.00%)** | **10,717 (100.00%)** |
+
+## Quickstart
+
+- Train data is located in "*data/SemEval2010_task8/TRAIN_FILE.TXT*".
+- `Vector_50d.txt` is used as pre-trained word2vec model.
+- We use micro-average F-score over the 18 relation labels apart from Other as our evaluation criteria.
+
+1. **Build** vocabularies and parameters for your dataset by running
+
+   ```shell
+   python build_vocab.py --data_dir data/SemEval2010_task8
+   ```
+
+   It will write vocabulary files `words.txt` and `labels.txt` containing the words and labels in the dataset. It will also save a `dataset_params.json` with some extra information.
+
+2. __Your experiment__ We created a `base_model` directory for you under the `experiments` directory. It contains a file `params.json` which sets the hyperparameters for the experiment. It looks like
+
+   ```json
+   {
+       "learning_rate": 1e-3,
+       "batch_size": 50,
+       "num_epochs": 100
+   }
+   ```
+
+   For every new experiment, you will need to create a new directory under `experiments` with a `params.json` file.
+
+3. **Train** your experiment. Simply run
+
+   ```shell
+   python train.py --data_dir data/SemEval2010_task8 --model_dir experiments/base_mode
+   ```
+
+   It will instantiate a model and train it on the training set following the hyperparameters specified in `params.json`. It will also evaluate some metrics on the development set.
+
+4. **Evaluation on the test set** Once you've run many experiments and selected your best model and hyperparameters based on the performance on the development set, you can finally evaluate the performance of your model on the test set. Run
+
+   ```shell
+   python evaluate.py --data_dir data/SemEval2010_task8 --model_dir experiments/base_model
+   ```
+
+## Results
+
+| Precision | Recall |  F1   |
+| :-------: | :----: | :---: |
+|   77.74   | 84.79  | 81.11 |
+
+## References
+
+- **Relation Classification via Convolutional Deep Neural Network** (COLING 2014), D Zeng et al. [[paper]](http://www.aclweb.org/anthology/C14-1220)
+- **Relation Extraction: Perspective from Convolutional Neural Networks** (NAACL 2015), TH Nguyen et al. [[paper]](http://www.cs.nyu.edu/~thien/pubs/vector15.pdf)
diff --git a/build_semeval_dataset.py b/build_semeval_dataset.py
@@ -0,0 +1,72 @@
+"""Read and save the semeval dataset for our model"""
+
+import os
+import re
+
+pattern_repl = re.compile('(<e1>)|(</e1>)|(<e2>)|(</e2>)|(\'s)')
+pattern_e1 = re.compile('<e1>(.*)</e1>')
+pattern_e2 = re.compile('<e2>(.*)</e2>')
+pattern_symbol = re.compile('^[!"#$%&\\\'()*+,-./:;<=>?@[\\]^_`{|}~]|[!"#$%&\\\'()*+,-./:;<=>?@[\\]^_`{|}~]$')
+
+
+def load_dataset(path_dataset):
+    """Load dataset into memory from text file"""
+    dataset = []
+    with open(path_dataset) as f:
+        piece = list()  # a piece of data
+        for line in f:
+            line = line.strip()
+            if line:
+                piece.append(line)
+            elif piece:
+                sentence = piece[0].split('\t')[1].strip('"')
+                e1 = delete_symbol(pattern_e1.findall(sentence)[0])
+                e2 = delete_symbol(pattern_e2.findall(sentence)[0])
+                new_sentence = list()
+                for word in pattern_repl.sub('', sentence).split(' '):
+                    new_word = delete_symbol(word)
+                    if new_word:
+                        new_sentence.append(new_word)
+
+                relation = piece[1]
+                dataset.append(((e1, e2, ' '.join(new_sentence)), relation))
+                piece = list()
+    return dataset
+
+def delete_symbol(text):
+    if pattern_symbol.search(text):
+        return pattern_symbol.sub('', text)
+    return text
+
+def save_dataset(dataset, save_dir):
+    """Write `sentences.txt` and `labels.txt` files in save_dir from dataset"""
+    # Create directory if it doesn't exist
+    print("Saving in {}...".format(save_dir))
+    if not os.path.exists(save_dir):
+        os.makedirs(save_dir)
+
+    # Export the dataset
+    with open(os.path.join(save_dir, 'sentences.txt'), 'w') as file_sentences, \
+        open(os.path.join(save_dir, 'labels.txt'), 'w') as file_labels:
+        for words, labels in dataset:
+            file_sentences.write('{}\n'.format('\t'.join(words)))
+            file_labels.write('{}\n'.format(labels))
+    print("- done.")
+
+
+if __name__ == '__main__':
+    path_train = 'data/SemEval2010_task8/TRAIN_FILE.TXT'
+    path_test = 'data/SemEval2010_task8/TEST_FILE.TXT'
+    msg = "{} or {} file not found. Make sure you have downloaded the right dataset".format(path_train, path_test)
+    assert os.path.isfile(path_train) and os.path.isfile(path_test), msg
+
+    # load the dataset into memory
+    print("Loading SemEval2010_task8 dataset into memory...")
+    train_dataset = load_dataset(path_train)
+    test_dataset = load_dataset(path_test)
+    print("- done.")
+
+    # save the dataset to text file
+    save_dataset(train_dataset, 'data/SemEval2010_task8/train')
+    save_dataset(test_dataset, 'data/SemEval2010_task8/test')
+
diff --git a/build_vocab.py b/build_vocab.py
@@ -0,0 +1,100 @@
+"""Build vocabularies of words and labels from datasets"""
+
+import argparse
+from collections import Counter
+import json
+import os
+
+
+parser = argparse.ArgumentParser()
+parser.add_argument('--min_count_word', default=1, help="Minimum count for words in the dataset", type=int)
+parser.add_argument('--min_count_tag', default=1, help="Minimum count for labels in the dataset", type=int)
+parser.add_argument('--data_dir', default='data/SemEval2010_task8', help="Directory containing the dataset")
+
+
+def save_to_txt(vocab, txt_path):
+    """Writes one token per line, 0-based line id corresponds to the id of the token.
+
+    Args:
+        vocab: (iterable object) yields token
+        txt_path: (stirng) path to vocab file
+    """
+    with open(txt_path, 'w') as f:
+        for token in vocab:
+            f.write(token + '\n')
+
+def save_dict_to_json(d, json_path):
+    """Saves dict to json file
+
+    Args:
+        d: (dict)
+        json_path: (string) path to json file
+    """
+    with open(json_path, 'w') as f:
+        d = {k: v for k, v in d.items()}
+        json.dump(d, f, indent=4)
+
+def update_vocab(txt_path, vocab):
+    """Update word and label vocabulary from dataset"""
+    with open(txt_path) as f:
+        for i, line in enumerate(f):
+            line = line.strip()
+            if line.endswith('...'):
+                line = line.rstrip('...')
+            word_seq = line.split('\t')[-1].split(' ')
+            vocab.update(word_seq)
+    return i + 1
+
+def update_labels(txt_path, labels):
+    """Update label vocabulary from dataset"""
+    with open(txt_path) as f:
+        for i, line in enumerate(f):
+            line = line.strip()  # one label per line
+            labels.update([line])
+    return i + 1
+
+
+if __name__ == '__main__':
+    args = parser.parse_args()
+
+    # Build word vocab with train and test datasets
+    print("Building word vocabulary...")
+    words = Counter()
+    size_train_sentences = update_vocab(os.path.join(args.data_dir, 'train/sentences.txt'), words)
+    size_test_sentences = update_vocab(os.path.join(args.data_dir, 'test/sentences.txt'), words)
+    print("- done.")
+
+    # Build label vocab with train and test datasets
+    print("Building label vocabulary...")
+    labels = Counter()
+    size_train_tags = update_labels(os.path.join(args.data_dir, 'train/labels.txt'), labels)
+    size_test_tags = update_labels(os.path.join(args.data_dir, 'test/labels.txt'), labels)
+    print("- done.")
+
+    # Assert same number of examples in datasets
+    assert size_train_sentences == size_train_tags
+    assert size_test_sentences == size_test_tags
+
+    # Only keep most frequent tokens
+    words = sorted([tok for tok, count in words.items() if count >= args.min_count_word])
+    labels = sorted([tok for tok, count in labels.items() if count >= args.min_count_tag])
+
+    # Save vocabularies to text file
+    print("Saving vocabularies to file...")
+    save_to_txt(words, os.path.join(args.data_dir, 'words.txt'))
+    save_to_txt(labels, os.path.join(args.data_dir, 'labels.txt'))
+    print("- done.")
+
+    # Save datasets properties in json file
+    sizes = {
+        'train_size': size_train_sentences,
+        'test_size': size_test_sentences,
+        'vocab_size': len(words),
+        'num_tags': len(labels)
+    }
+    save_dict_to_json(sizes, os.path.join(args.data_dir, 'dataset_params.json'))
+
+    # Logging sizes
+    to_print = "\n".join("-- {}: {}".format(k, v) for k, v in sizes.items())
+    print("Characteristics of the dataset:\n{}".format(to_print))
+