The Focused Open Biology Information Extraction (FOBIE) dataset aims to support IE from Computer-Aided Biomimetics. The dataset contains ~1,500 sentences from scientific biological texts. These sentences are annotated with TRADE-OFFS and syntactically similar relations between unbounded arguments, as well as argument-modifiers.
The FOBIE dataset has been used to explore Semi-Open Relation Extraction (SORE). The code for this and instructions can be found inside the SORE folder Readme.md, or in the ReadTheDocs documentations.
The train/test/dev data files are provided in two formats. A verbose json format inspired on the Semeval2018 task 7 dataset:
{"[document_ID]":
{"[relation_ID_within_document]":
{"annotations":
{"modifiers":
{"[within_sentence_modifier_ID]":
{"Arg0": {"span_start": "[token_index]",
"span_end": "[token_index]",
"span_id": "[brat_ID]",
"text": "[string]"},
"Arg1": {"span_start": "[token_index]",
"span_end": "[token_index]",
"span_id": "[brat_ID]",
"text": "[string]"}
}
},
"tradeoffs":
{"[within_sentence_tradeoff_ID]":
{"Arg0": {"span_start": "[token_index]",
"span_end": "[token_index]",
"span_id": "[brat_ID]",
"text": "[string]"},
"Arg1": {"span_start": "[token_index]",
"span_end": "[token_index]",
"span_id": "[brat_ID]",
"text": "[string]"},
"TO_indicator": {"span_start": "[token_index]",
"span_end": "[token_index]",
"span_id": "[brat_ID]",
"text": "[string]"},
"labels": {"Confidence": "High"}
}
}
},
"sentence": "[string]"
}
},
And the Sci-ERC dataset format, which is used to train the SciIE system:
{ "clusters": [],
"sentences": [["List", "of", "some", "tokens", "."]],
"ner": [[[4, 4, "Generic"]]],
"relations": [[[4, 4, 6, 17, "Tradeoff"]]],
"doc_key": "XXX"}
We also provide a script to convert data from the verbose format to SciIE format, as well as a script to convert BRAT annotations to the verbose format.
Also see dataset_statistics.py under the scripts folder.
Train | Dev | Test | Total | |
---|---|---|---|---|
# Unique documents | 1010 | 138 | 144 | 1292 |
# Sentences | 1248 | 150 | 150 | 1548 |
Avg. sent. length | 37.42 | 38.91 | 40.02 | 37.81 |
% of sents ≥ 25 tokens | 82.21 % | 85.33 % | 83.33 % | 82.62% |
Relations: | ||||
- Trade-Off | 639 | 54 | 72 | 765 |
- Not-a-Trade-Off | 2004 | 258 | 240 | 2502 |
- Arg-Modifier | 1247 | 142 | 132 | 1521 |
Triggers | 1292 | 155 | 153 | 1600 |
Keyphrases | 3436 | 401 | 398 | 4235 |
Keyphrases w/ multiple relations | 1600 | 188 | 163 | 1951 |
Spans | 4728 | 556 | 551 | 5835 |
Max relations/sent | 9 | 8 | 8 | |
Max spans/sent | 9 | 8 | 8 | |
Max triggers/sent | 2 | 2 | 2 | |
Max args/trigger | 5 | 4 | 4 | |
Unique spans | 3643 | |||
Unique triggers | 41 | |||
# single-word keyphrases | 864 (20.4%) | |||
Avg. tokens per keyphrase | 3.46 |
If you use the FOBIE dataset or SORE code in your research, please consider citing the following papers:
@inproceedings{Kruiper2020_SORE,
author = "Kruiper, Ruben
and Vincent, Julian F V
and Chen-Burger, Jessica
and Desmulliez, Marc P Y
and Konstas, Ioannis",
title = "In Layman's Terms: Semi-Open Relation Extraction from Scientific Texts"
year = "2020",
url = "https://arxiv.org/pdf/2005.07751.pdf",
arxivId = "2005.07751"
}
@inproceedings{Kruiper2020_FOBIE,
author = "Kruiper, Ruben
and Vincent, Julian F V
and Chen-Burger, Jessica
and Desmulliez, Marc P Y
and Konstas, Ioannis",
title = "A Scientific Information Extraction Dataset for Nature Inspired Engineering"
booktitle = "Proceedings of the 12th Conference on Language Resources and Evaluation (LREC 2020)",
year = "2020",
keywords = "Biomimetics,Relation Extraction,Scientific Information Extraction,Trade-Offs",
pages = "2078--2085",
url = "http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.255.pdf",
arxivId = "2005.07753"
}
The FOBIE dataset along with SORE code in this repository are licensed under a Creative Commons Attribution 4.0 License.