GitHub - rubenkruiper/FOBIE: FOBIE dataset and code for Semi-Open Relation Extraction, applied to Biology for Computer-Aided Biomimetics.

Semi-Open Relation Extraction

The Focused Open Biology Information Extraction (FOBIE) dataset aims to support IE from Computer-Aided Biomimetics. The dataset contains ~1,500 sentences from scientific biological texts. These sentences are annotated with TRADE-OFFS and syntactically similar relations between unbounded arguments, as well as argument-modifiers.

The FOBIE dataset has been used to explore Semi-Open Relation Extraction (SORE). The code for this and instructions can be found inside the SORE folder Readme.md, or in the ReadTheDocs documentations.

Format

The train/test/dev data files are provided in two formats. A verbose json format inspired on the Semeval2018 task 7 dataset:

{"[document_ID]":
  {"[relation_ID_within_document]":
    {"annotations":
      {"modifiers":
        {"[within_sentence_modifier_ID]":
          {"Arg0": {"span_start": "[token_index]",
                    "span_end": "[token_index]",
                    "span_id": "[brat_ID]",
                    "text": "[string]"},
           "Arg1": {"span_start": "[token_index]",
                    "span_end": "[token_index]",
                    "span_id": "[brat_ID]",
                    "text": "[string]"}
          }
       },
     "tradeoffs":
        {"[within_sentence_tradeoff_ID]":
          {"Arg0": {"span_start": "[token_index]",
                    "span_end": "[token_index]",
                    "span_id": "[brat_ID]",  
                    "text": "[string]"},
          "Arg1": {"span_start": "[token_index]",
                   "span_end": "[token_index]",
                   "span_id": "[brat_ID]",  
                   "text": "[string]"},           
          "TO_indicator": {"span_start": "[token_index]",
                           "span_end": "[token_index]",
                           "span_id": "[brat_ID]",  
                           "text": "[string]"},
          "labels": {"Confidence": "High"}
        }
      }
    },
    "sentence": "[string]"
  }
},

And the Sci-ERC dataset format, which is used to train the SciIE system:

{   "clusters": [],
    "sentences": [["List", "of", "some", "tokens", "."]],
    "ner": [[[4, 4, "Generic"]]],
    "relations": [[[4, 4, 6, 17, "Tradeoff"]]],
    "doc_key": "XXX"}

We also provide a script to convert data from the verbose format to SciIE format, as well as a script to convert BRAT annotations to the verbose format.

Statistics

Also see dataset_statistics.py under the scripts folder.

	Train	Dev	Test	Total
_{# Unique documents}	₁₀₁₀	₁₃₈	₁₄₄	₁₂₉₂
_{# Sentences}	₁₂₄₈	₁₅₀	₁₅₀	₁₅₄₈
_{Avg. sent. length}	_37.42	_38.91	_40.02	_37.81
_{% of sents ≥ 25 tokens}	_{82.21 %}	_{85.33 %}	_{83.33 %}	_82.62%
_Relations:
_{- Trade-Off}	₆₃₉	₅₄	₇₂	₇₆₅
_{- Not-a-Trade-Off}	₂₀₀₄	₂₅₈	₂₄₀	₂₅₀₂
_{- Arg-Modifier}	₁₂₄₇	₁₄₂	₁₃₂	₁₅₂₁
_Triggers	₁₂₉₂	₁₅₅	₁₅₃	₁₆₀₀
_Keyphrases	₃₄₃₆	₄₀₁	₃₉₈	₄₂₃₅
_{Keyphrases w/ multiple relations}	₁₆₀₀	₁₈₈	₁₆₃	₁₉₅₁
_Spans	₄₇₂₈	₅₅₆	₅₅₁	₅₈₃₅
_{Max relations/sent}	₉	₈	₈
_{Max spans/sent}	₉	₈	₈
_{Max triggers/sent}	₂	₂	₂
_{Max args/trigger}	₅	₄	₄
_{Unique spans}				₃₆₄₃
_{Unique triggers}				₄₁
_{# single-word keyphrases}				_{864 (20.4%)}
_{Avg. tokens per keyphrase}				_3.46

If you use the FOBIE dataset or SORE code in your research, please consider citing the following papers:

@inproceedings{Kruiper2020_SORE,
  author =      "Kruiper, Ruben
                and Vincent, Julian F V
                and Chen-Burger, Jessica
                and Desmulliez, Marc P Y
                and Konstas, Ioannis",
  title =       "In Layman's Terms: Semi-Open Relation Extraction from Scientific Texts"
  year =        "2020",
  url =         "https://arxiv.org/pdf/2005.07751.pdf",
  arxivId =     "2005.07751"
}

@inproceedings{Kruiper2020_FOBIE,
  author =      "Kruiper, Ruben
                and Vincent, Julian F V
                and Chen-Burger, Jessica
                and Desmulliez, Marc P Y
                and Konstas, Ioannis",
  title =       "A Scientific Information Extraction Dataset for Nature Inspired Engineering"
  booktitle =   "Proceedings of the 12th Conference on Language Resources and Evaluation (LREC 2020)",
  year =        "2020",
  keywords =    "Biomimetics,Relation Extraction,Scientific Information Extraction,Trade-Offs",
  pages =       "2078--2085",
  url =         "http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.255.pdf",
  arxivId =     "2005.07753"
}

The FOBIE dataset along with SORE code in this repository are licensed under a Creative Commons Attribution 4.0 License.

Name		Name	Last commit message	Last commit date
Latest commit History 73 Commits
SORE		SORE
data		data
docs		docs
scripts		scripts
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
run_SORE.py		run_SORE.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Semi-Open Relation Extraction

Format

Statistics

About

Releases

Packages

Languages

License

rubenkruiper/FOBIE

Folders and files

Latest commit

History

Repository files navigation

Semi-Open Relation Extraction

Format

Statistics

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages