We present an approach for constructing an RDF knowledge graph for Datasets. To build the knowledge graph, we use datasets registered in OpenAIRE and Wikidata. We identify all publications out of 146 million scientific publications which contain mentions of datasets, and establish links between the dataset and publication representations in the Microsoft Academic Knowledge Graph. As the author names of datasets can be ambiguous, we develop and evaluate a method for author name disambiguation and enrich the knowledge graph with links to ORCID. Overall, our knowledge graph contains 2,208 datasets with associated properties, as well as 813,551 links to scientific publications. It can be used for a variety of scenarios, facilitating advanced dataset search systems and new ways of measuring and awarding the provisioning of datasets. The constructed data set knowledge graph (DSKG) is provided with a SPARQL endpoint and resolvable URIs at http://dskg.org and is also available at Zenodo.
Schema of the DSKGThe repository provides all the scripts needed to create the knowledge graph semi-automatically. The following manuel explains how to create the knowledge graph.
We use the following database with metadata about datasets for the creation of the DSKG:
- OpenAIRE-Dataset: We consider a subset of the OpenAIRE Research Graph dump which contains metadata about datasets. The used dump is created with this code: https://github.com/michaelfaerber/OpenAIRE.
- Wikidata-Dataset: We use instances of the classes of Wikidata which represent datasets. The instances of the relevant classes and their properties can be accessed based on semantic querys via the publicly available Wikidata SPARQL endpoint.
We use a string-based algorithm to detect mentions of datasets in papers. We use the files containing the paper abstracts and citation context of the MAG-dump for the matching. For dataset from OpenAIRE, we use the following metadata information to recognize dataset in the files: title
, originalId
and doi
. For dataset from Wikidata, we use: itemLabel
, altLabel
, officialWebsite
, workURL
and url
.
- The first step is to filter out the most frequently used English words for the match. The following Script calculates this not considered intersection:
match_text_corpus.py
. - For then run the script for the matching. MAG dumps are used as input and the output are text files with the matches found:
string_based_matching_MAKG.py
. - After that, the results are filtered using the created filter list to reduce false matches:
filterwords_matching.py
. - The following script inserts the found matches into these initial datasets (csv-files) of the OpenAIRE and Wikidata dataset:
MAKG_links_in_csv.py
. In the following we will only use the metadata records for which at least one link could be found.
We implemented the data transformation of the original metadata using SPARQL CONSTRUCT and SPARQL INSERT querys in Ontotext's GraphDB graph database.
- Clean up the OpenAIRE dataset (csv-file) entries and adapt the metadata entries of the size property to DCAT:
preprocessing_OpenAIRE.py
. - Perfom the classification of the metadata entries for OpenAIRE and Wikidata according to DCAT:
classification_resources.py
. - In GraphDB: Create beta version of the DSKG where the properties are mapped to DCAT but no URIs for the resources are assigned yet. The creation of the dskg-beta-version is realized with SPARQL CONSTRUCT and INSERT (
SPARQL_CONSTRUCT_openAIRE_beta_version.txt
andSPARQL_CONSTRUCT_wikidata_beta_version.txt
) querys for the OpenAIRE and Wikidata dataset in tabular form (csv-files). For the further steps, the beta version of the dskg is compiled as a table form using a SPARQL SELECT query in GraphDB. - Use the file
PaperFieldsOfStudy.txt
from the MAG-dump, the dskg-beta-version and the Jupyter Notebookfields_of_application.ipynb
to determine the fields of applications of the datasets and add it to the dskg-beta-version. - Perform the author disambiguation explained in the paragraph below.
- Assignment of unique URIs for the entities in the dskg-beta-version (uses the results of the performed author disambiguation): assign_uris_for_entities.py.
- Load the enriched information from the dskg-beta version into the classified OpenAIRE and Wikidata dataset for the final construction of the knowledge graph. Create csv-files from the dskg-beta-version for each classes of entities in the metadata:
final_csv_files_transformation_dcat.ipynb
. - Load the generated csv-files into a GraphDB Repository and transform the table data into RDF using the SPARQl CONSTRUCT and SPARQl INSERT querys to construct the final DSKG.
Note on using SPARQL CONSTRUCT and INSERT querys in GraphDB:
The SPARQL INSERT querys are identical to the CONSTRUCT querys, except for the replacement of the keyword (INSERT instead of CONSTRUCT, the removal of the LIMIT 100 restriction and the addition of the corresponding SPARQL endpoint within the WHERE clause: WHERE { SERVICE <ontorefine:99999999999> {...} }
.
<ontorefine:99999999999>
is an example for a SPARQL endpoint in GraphDB.
- Perform a SPARQL Query over the dskg-beta-version to get a table with the relevant information of the datasets for the LDA model.
- Calculate the LDA vectors for the datasets and load it into the dskg-beat-version for the author disambiguation with the Jupyter Notebook
LDA-Modell.ipynb
. - Perfom the author disambiguation with the Jupyter Notebook
author_disambiguation.ipynb
. Use the dskg-beta-version from the LDA model as input. The Code first creates a txt-file that contains all the necessary information for the author disambiguation which is then used to perform the author disambiguation.
- Perform a SPARQL Query over the DSKG to get a table (csv-file) with the author profiles from the knowledge graph.
- Query the titles of the linked papers using the MAKG SPARQL endpoint:
02MAKG_paper_titels.py
. - Query of author names via the ORCID API:
03ORCID_API.py
. - Perform the linking to ORCID by running the Script that compares the author profiles:
04linking_authors_to_orcid.py
. - Insert the found ORCID IDs of the authors into the csv-file which contains the metadata of the authors:
05add_ORCID_IDs_to_csv.py
. - Add the ORCID-IDs to the knowledge graph in GraphDB using SPARQL CONSTRUCT and SPARQL INSERT.
See http://dskg.org/.
The system has been designed and implemented by Michael Färber and David Lamprecht. Feel free to reach out to us:
Michael Färber, michael.faerber@kit.edu
Please cite our paper as follows:
@article{Faerber2021DSKG,
author = {Michael F{\"{a}}rber and
David Lamprecht},
title = "{The data set knowledge graph: Creating a linked open data source for data sets}",
journal = {Quantitative Science Studies},
publisher = {MIT Press},
volume = {2},
number = {4},
pages = {1324-1355},
year = {2021},
issn = {2641-3337},
doi = {10.1162/qss_a_00161},
url = {https://doi.org/10.1162/qss\_a\_00161}
}