This repository contains the links, data, and scripts to address the following overarching research objective: information extraction of species, locations, habitat, and ecosystems.
Step 1: Create the corpus.
Starting with the Wikidata Invasion Biology collection released on Zenodo, with the paper metadata, we compiled a corpus for text data mining. This consisted of retrieving the paper abstracts and full text from the ORKG Ask external API service.
Step 2: Define the Information Extraction (IE) model.
Step 3:
The scripts/invasion-biology-full_text-search.py queries the ASK ORKG API to retrieve metadata and full texts (where available) of publications listed in the Invasion Biology WikiProject. Specifically, it uses the explore document GET request, with each publication’s DOI as the document ID, to build a dataset for text data mining.
The source for these DOIs is a dataset of 49,716 publications, compiled from the Invasion Biology WikiProject published at DOI 10.5281/zenodo.12518036.
- data/publications-with-full_text.csv – Contains records for publications where full text is available.
- data/publications-no_full_text.csv - Contains records for publications where the DOI is in the ASK datastore but the full text is not available.
- data/publications-error_log.csv – Contains records for publications whose DOIs were not found in the ASK datastore.
The scripts/hypothesis-search.py uses the ASK ORKG API’s semantic search GET request to retrieve the top 50 publications relevant to a set of expert-curated hypotheses. These hypotheses are derived from the original dataset published at DOI 10.5281/zenodo.12518036.
Each output record includes the following fields: hypothesis
, publication_id
, title
, doi
, authors
, year
, abstract
, full_text
, subjects
, topics
, journals
, and publisher
.
- data/hypotheses-based-publications.csv – Publications relevant to hypotheses, without date restrictions.
- data/hypotheses-based-publications-after-2010.csv – Publications relevant to hypotheses, filtered to include only those published after 2010.
Hypothesis | Publications (No Date Filter) | Publications (Post-2010) |
---|---|---|
Antarctic climate-diversity-invasion hypothesis | 2 | 3 |
Anthropogenically induced adaptation to invade | 4 | 2 |
Enhanced Mutualism Hypothesis | 4 | 4 |
Intermediate Disturbance Hypothesis | 1 | 1 |
Biotic Resistance Hypothesis | 0 | 3 |
Disturbance Hypothesis | 2 | 2 |
Enemy Release Hypothesis | 2 | 0 |
Habitat Amount Hypothesis | 1 | 2 |
Invasional Meltdown Hypothesis | 3 | 5 |
Island Susceptibility Hypothesis | 2 | 2 |
Limiting Similarity Hypothesis | 1 | 0 |
Novel Weapons Hypothesis | 8 | 6 |
Phenotypic Plasticity Hypothesis | 3 | 3 |
Propagule Pressure Hypothesis | 2 | 1 |
Tens Rule | 2 | 1 |
- meta-analysis/total_publisher_counts.csv – Summarizes publisher counts for publications without date restrictions.
- meta-analysis/total_publisher_counts_after_2010.csv – Summarizes publisher counts for publications published after 2010.
This repository extends the original Invasion Biology Corpus, compiled from curated data within the Invasion Biology WikiProject, for text data mining applications.