Skip to content

Commit

Permalink
Initial commit
Browse files Browse the repository at this point in the history
  • Loading branch information
lemonhu committed Dec 19, 2018
1 parent 273b7f7 commit b231bbc
Show file tree
Hide file tree
Showing 25 changed files with 113,739 additions and 0 deletions.
104 changes: 104 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,104 @@
# Convolutional Neural Network for Relation Extraction

Pytorch Implementation of Deep Learning Approach for Relation Extraction Challenge([**SemEval-2010 Task #8: Multi-Way Classification of Semantic Relations Between Pairs of Nominals**](https://docs.google.com/document/d/1QO_CnmvNRnYwNWu1-QCAeR5ToQYkXUqFeAJbdEhsq7w/preview)) via Convolutional Neural Network.

![Architecture](./img/Architecture.jpeg)

## Requirements

We recommend using python3 and a conda env.

```shell
source activate your_env
pip install -r requirements.txt
```

## Data: SemEval-2010 Task #8

- Given: a sentence marked with a pair of *nominals*
- Goal: recognize the semantic relation between these nominals.
- Example:
- "There were apples, <e1>**pears**</e1> and oranges in the <e2>**bowl**</e2>."
=> *Content-Container(e1,e2)*
- “The cup contained <e1>**tea**</e1> from dried <e2>**ginseng**</e2>.”
=> *Entity-Origin(e1,e2)*

### The Inventory of Semantic Relations

1. *Cause-Effect*: An event or object leads to an effect(those cancers were caused by radiation exposures)
2. *Instrument-Agency*: An agent uses an instrument(phone operator)
3. *Product-Producer*: A producer causes a product to exist (a factory manufactures suits)
4. *Content-Container*: An object is physically stored in a delineated area of space (a bottle full of honey was weighed) Hendrickx, Kim, Kozareva, Nakov, O S´ eaghdha, Pad ´ o,´ Pennacchiotti, Romano, Szpakowicz Task Overview Data Creation Competition Results and Discussion The Inventory of Semantic Relations (III)
5. *Entity-Origin*: An entity is coming or is derived from an origin, e.g., position or material (letters from foreign countries)
6. *Entity-Destination*: An entity is moving towards a destination (the boy went to bed)
7. *Component-Whole*: An object is a component of a larger whole (my apartment has a large kitchen)
8. *Member-Collection*: A member forms a nonfunctional part of a collection (there are many trees in the forest)
9. *Message-Topic*: An act of communication, written or spoken, is about a topic (the lecture was about semantics)
10. *Other*: If none of the above nine relations appears to be suitable.

### Distribution for Dataset

| Relation | Train Data | Test Data | Total Data |
| :----------------: | :-----------------: | :-----------------: | :------------------: |
| Cause-Effect | 1,003 (12.54%) | 328 (12.07%) | 1331 (12.42%) |
| Instrument-Agency | 504 (6.30%) | 156 (5.74%) | 660 (6.16%) |
| Product-Producer | 717 (8.96%) | 231 (8.50%) | 948 (8.85%) |
| Content-Container | 540 (6.75%) | 192 (7.07%) | 732 (6.83%) |
| Entity-Origin | 716 (8.95%) | 258 (9.50%) | 974 (9.09%) |
| Entity-Destination | 845 (10.56%) | 292 (10.75%) | 1137 (10.61%) |
| Component-Whole | 941 (11.76%) | 312 (11.48%) | 1253 (11.69%) |
| Member-Collection | 690 (8.63%) | 233 (8.58%) | 923 (8.61%) |
| Message-Topic | 634 (7.92%) | 261 (9.61%) | 895 (8.35%) |
| Other | 1,410 (17.63%) | 454 (16.71%) | 1864 (17.39%) |
| **Total** | **8,000 (100.00%)** | **2,717 (100.00%)** | **10,717 (100.00%)** |

## Quickstart

- Train data is located in "*data/SemEval2010_task8/TRAIN_FILE.TXT*".
- `Vector_50d.txt` is used as pre-trained word2vec model.
- We use micro-average F-score over the 18 relation labels apart from Other as our evaluation criteria.

1. **Build** vocabularies and parameters for your dataset by running

```shell
python build_vocab.py --data_dir data/SemEval2010_task8
```

It will write vocabulary files `words.txt` and `labels.txt` containing the words and labels in the dataset. It will also save a `dataset_params.json` with some extra information.

2. __Your experiment__ We created a `base_model` directory for you under the `experiments` directory. It contains a file `params.json` which sets the hyperparameters for the experiment. It looks like

```json
{
"learning_rate": 1e-3,
"batch_size": 50,
"num_epochs": 100
}
```

For every new experiment, you will need to create a new directory under `experiments` with a `params.json` file.

3. **Train** your experiment. Simply run

```shell
python train.py --data_dir data/SemEval2010_task8 --model_dir experiments/base_mode
```

It will instantiate a model and train it on the training set following the hyperparameters specified in `params.json`. It will also evaluate some metrics on the development set.

4. **Evaluation on the test set** Once you've run many experiments and selected your best model and hyperparameters based on the performance on the development set, you can finally evaluate the performance of your model on the test set. Run

```shell
python evaluate.py --data_dir data/SemEval2010_task8 --model_dir experiments/base_model
```

## Results

| Precision | Recall | F1 |
| :-------: | :----: | :---: |
| 77.74 | 84.79 | 81.11 |

## References

- **Relation Classification via Convolutional Deep Neural Network** (COLING 2014), D Zeng et al. [[paper]](http://www.aclweb.org/anthology/C14-1220)
- **Relation Extraction: Perspective from Convolutional Neural Networks** (NAACL 2015), TH Nguyen et al. [[paper]](http://www.cs.nyu.edu/~thien/pubs/vector15.pdf)
72 changes: 72 additions & 0 deletions build_semeval_dataset.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,72 @@
"""Read and save the semeval dataset for our model"""

import os
import re

pattern_repl = re.compile('(<e1>)|(</e1>)|(<e2>)|(</e2>)|(\'s)')
pattern_e1 = re.compile('<e1>(.*)</e1>')
pattern_e2 = re.compile('<e2>(.*)</e2>')
pattern_symbol = re.compile('^[!"#$%&\\\'()*+,-./:;<=>?@[\\]^_`{|}~]|[!"#$%&\\\'()*+,-./:;<=>?@[\\]^_`{|}~]$')


def load_dataset(path_dataset):
"""Load dataset into memory from text file"""
dataset = []
with open(path_dataset) as f:
piece = list() # a piece of data
for line in f:
line = line.strip()
if line:
piece.append(line)
elif piece:
sentence = piece[0].split('\t')[1].strip('"')
e1 = delete_symbol(pattern_e1.findall(sentence)[0])
e2 = delete_symbol(pattern_e2.findall(sentence)[0])
new_sentence = list()
for word in pattern_repl.sub('', sentence).split(' '):
new_word = delete_symbol(word)
if new_word:
new_sentence.append(new_word)

relation = piece[1]
dataset.append(((e1, e2, ' '.join(new_sentence)), relation))
piece = list()
return dataset

def delete_symbol(text):
if pattern_symbol.search(text):
return pattern_symbol.sub('', text)
return text

def save_dataset(dataset, save_dir):
"""Write `sentences.txt` and `labels.txt` files in save_dir from dataset"""
# Create directory if it doesn't exist
print("Saving in {}...".format(save_dir))
if not os.path.exists(save_dir):
os.makedirs(save_dir)

# Export the dataset
with open(os.path.join(save_dir, 'sentences.txt'), 'w') as file_sentences, \
open(os.path.join(save_dir, 'labels.txt'), 'w') as file_labels:
for words, labels in dataset:
file_sentences.write('{}\n'.format('\t'.join(words)))
file_labels.write('{}\n'.format(labels))
print("- done.")


if __name__ == '__main__':
path_train = 'data/SemEval2010_task8/TRAIN_FILE.TXT'
path_test = 'data/SemEval2010_task8/TEST_FILE.TXT'
msg = "{} or {} file not found. Make sure you have downloaded the right dataset".format(path_train, path_test)
assert os.path.isfile(path_train) and os.path.isfile(path_test), msg

# load the dataset into memory
print("Loading SemEval2010_task8 dataset into memory...")
train_dataset = load_dataset(path_train)
test_dataset = load_dataset(path_test)
print("- done.")

# save the dataset to text file
save_dataset(train_dataset, 'data/SemEval2010_task8/train')
save_dataset(test_dataset, 'data/SemEval2010_task8/test')

100 changes: 100 additions & 0 deletions build_vocab.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,100 @@
"""Build vocabularies of words and labels from datasets"""

import argparse
from collections import Counter
import json
import os


parser = argparse.ArgumentParser()
parser.add_argument('--min_count_word', default=1, help="Minimum count for words in the dataset", type=int)
parser.add_argument('--min_count_tag', default=1, help="Minimum count for labels in the dataset", type=int)
parser.add_argument('--data_dir', default='data/SemEval2010_task8', help="Directory containing the dataset")


def save_to_txt(vocab, txt_path):
"""Writes one token per line, 0-based line id corresponds to the id of the token.
Args:
vocab: (iterable object) yields token
txt_path: (stirng) path to vocab file
"""
with open(txt_path, 'w') as f:
for token in vocab:
f.write(token + '\n')

def save_dict_to_json(d, json_path):
"""Saves dict to json file
Args:
d: (dict)
json_path: (string) path to json file
"""
with open(json_path, 'w') as f:
d = {k: v for k, v in d.items()}
json.dump(d, f, indent=4)

def update_vocab(txt_path, vocab):
"""Update word and label vocabulary from dataset"""
with open(txt_path) as f:
for i, line in enumerate(f):
line = line.strip()
if line.endswith('...'):
line = line.rstrip('...')
word_seq = line.split('\t')[-1].split(' ')
vocab.update(word_seq)
return i + 1

def update_labels(txt_path, labels):
"""Update label vocabulary from dataset"""
with open(txt_path) as f:
for i, line in enumerate(f):
line = line.strip() # one label per line
labels.update([line])
return i + 1


if __name__ == '__main__':
args = parser.parse_args()

# Build word vocab with train and test datasets
print("Building word vocabulary...")
words = Counter()
size_train_sentences = update_vocab(os.path.join(args.data_dir, 'train/sentences.txt'), words)
size_test_sentences = update_vocab(os.path.join(args.data_dir, 'test/sentences.txt'), words)
print("- done.")

# Build label vocab with train and test datasets
print("Building label vocabulary...")
labels = Counter()
size_train_tags = update_labels(os.path.join(args.data_dir, 'train/labels.txt'), labels)
size_test_tags = update_labels(os.path.join(args.data_dir, 'test/labels.txt'), labels)
print("- done.")

# Assert same number of examples in datasets
assert size_train_sentences == size_train_tags
assert size_test_sentences == size_test_tags

# Only keep most frequent tokens
words = sorted([tok for tok, count in words.items() if count >= args.min_count_word])
labels = sorted([tok for tok, count in labels.items() if count >= args.min_count_tag])

# Save vocabularies to text file
print("Saving vocabularies to file...")
save_to_txt(words, os.path.join(args.data_dir, 'words.txt'))
save_to_txt(labels, os.path.join(args.data_dir, 'labels.txt'))
print("- done.")

# Save datasets properties in json file
sizes = {
'train_size': size_train_sentences,
'test_size': size_test_sentences,
'vocab_size': len(words),
'num_tags': len(labels)
}
save_dict_to_json(sizes, os.path.join(args.data_dir, 'dataset_params.json'))

# Logging sizes
to_print = "\n".join("-- {}: {}".format(k, v) for k, v in sizes.items())
print("Characteristics of the dataset:\n{}".format(to_print))

Loading

0 comments on commit b231bbc

Please sign in to comment.