Skip to content

Pipeline to extract software implementations from research publications

License

Notifications You must be signed in to change notification settings

ptorija/RSEF

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

235 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DOI

Research Software Extraction Framework (RSEF)

Introduction

This tool verifies the link between a scientific paper and a software repository. It accomplishes this by locating the URL of the software repository within the scientific paper. It then extracts the repository's metadata to find any URLs associated with scientific papers and checks if they lead back to the original paper. If a bidirectional link is established, it marks it as "bidirectional".

There is also a "unidirectional" metric, which finds a repository url and see's within the repository if the paper is named.

Dependencies

Installation

Install the required dependencies by running:


pip install -e .

Highly recommended steps:


somef configure

You will be asked to provide:

Usage


Usage: rsef [OPTIONS] COMMAND [ARGS]...

RRRRRRRRR     SSSSSSSSS    EEEEEEEEE  FFFFFFFFF  
RRR    RRR   SSS     SSS   EEE        FFF  
RRR    RRR   SSSS          EEE        FFF
RRRRRRRRR     SSSSSSSSS    EEEEEEE    FFFFFFF  
RRR    RRR          SSSS   EEE        FFF  
RRR     RRR   SSS    SSS   EEE        FFF  
RRR      RRR   SSSSSSSS    EEEEEEEEE  FFF  
  
Research Software Extraction Framework (RSEF)\n
Find and assess Research Software within Research papers.

Usage:
1. (assess) Assess doi for unidirectionality or bidirectionality
2. (download) Download PDF (paper) from a doi or list

Options:
--version Show the version and exit.
-h, --help  Show this message and exit.

Commands:
	assess
	download
 

Download

The download command allows for a user to download the pdf with its metadata given an Identifier: ArXiv or DOI. Alongside the PDFs folder there will be a downloaded_metadata.json which will have the Title, DOI, ArXiv and filename/filepath for each paper downloaded.

rsef download -h 
Usage: rsef download [OPTIONS]

Options:

-i, --input <name> DOI or path to .txt list of DOIs  [required]

-o, --output <path>  Output Directory [default: ./]

-h, --help Show this message and exit.

Assess

The assess command allows for a user to determine whether a given Identifier, in this case ArXiv or DOI, is bidirectional or not. This command accepts different inputs:

  1. A single DOI/ArXiv

  2. A list of identifiers given as a .txt

  3. A path to downloaded_metadata.json file

  4. A path to processed_metadata.json file

  5. A path to results.json file, obtained by executing doiExtractor tool (https://github.com/oeg-upm/DOI-Extractor-OEG). This results.json file has the following format:

[
    {
        "title": title of the paper,
        "doi": DOI of the paper,
        "primary_location": URL of the paper's PDF if exists in OpenAlex
    },
]

The resulting information after executing the command will be saved in output/url_search_output.json.

rsef assess -h
Usage: sskg assess [OPTIONS]

Options:

-i, --input <name> DOI, path to .txt list of DOIs or path to processed_metadata.json [required]

-o, --output <path>  Output csv file  [default: output]

-U, --unidir Unidirectional link search

-B, --bidir  Bidirectional link search

-h, --help Show this message and exit.
  • The unidirectional link search uses the RepoFromPaper submodule to search for the implementation repository link in the paper. The RepoFromPaper submodule utilizes a SciBert Model to classify the paper's text as either a proposal sentence or not. The top 5 highest ranked sentences are then searched for a repository link. If no link is found in the initial search, a footnote/reference search is conducted. RepoFromPaper can either return a link or return an empty response if no link is found.
  • The bidirectional link search intends to find paper-repo links where both the paper points to the repo and the repo (metadata) points back to the paper. The bidir search targets 'Git' and 'Zenodo' links. The search may find zero, one, or multiple bidirectional links.
  • If repository links are found in the paper, the links and search methods (unidir/bidir) are added to the ImplementationUrl list in the PaperObj of the paper. The list of implementation urls is initially created using regex link search method when the paper object is created. Both the unidir and bidir search methods return the PaperObj, with the updated ImplementationUrl list if links were found.

Structure

The src/RSEF is divided into the following directories:

  1. Download_pdf

  2. Metadata

  3. Extraction

  4. Object_creator

  5. RepoFromPaper

  6. Modelling

  7. Prediction

  8. Utils

Download_pdf

Pertains to all the downloading of pdfs.

Downloaded_obj is a representation of downloaded papers which have not been processed yet.

Contains:

- Title 
- DOI
- ArXiv
- file_path
- file_name

These objects are normally saved into a downloaded_metadata.json

Metadata

Encompasses all petitions to OpenAlex and other api's for fetching the paper's metadata or general requests.

MetadataObj contains the metadata from OpenAlex: doi, arxiv and its title.

Extraction

Tika scripts to open a pdf and extract its urls are also found witin this module.

PaperObj is created once the downloadedObj's pdf has been processed to locate all its urls. Contains:

  • DOI
  • arXiv
  • Abstract
  • Title
  • File_path
  • File_name
  • URLs

Finally, the necessary functions downloading a repository and extracting its metadata with SOMEF

Object Creator

This is the pipeline broken down into its main parts. Please look at pipeline.py to view the execution process.

RepoFromPaper

Contains the code for the RepoFromPaper (RFP) package. RFP uses a combination of natural language processing and heuristics to identify and extract the repository links mentioned in a proposal manner from a paper.

As input it receives the local file path of the paper and returns the repository URL if found.

Modelling

Contains all assessment of bi-directionality and uni-directionality.

Receives a paperObj and a repository_metadata json.

Prediction

For assessment of the program against its corpus. The corpus can be found within corpus.csv and the f1 score obtained bidirectional: corpus_eval_bidir.json and the same for the unidirectional (_unidir)

Tests

Tests can be found in the ./tests folder

Evaluation

This software has been evaluate it with a corpus created expressly.

This corpus, composed by 154 papers-repositories, includes several types of bi-directionality paper-repository links (e.g., repositories including just the title, Arxiv URLs or DOIs in the README file, using citation files, repositories in different code platforms, etc.)

Two annotators were selected to manually annotate the presence of bi-directionality. In case of two different annotations in the corpus, a third annotator decided the final value.

This corpus, in TSV format, can be found in the evaluation/bidirectional folder.

The RAW version of the corpus can be found here.

In case you want to run the evaluation, you will have to use the script presents in the evaluation_only_ids folder.

python eval_corpus_big.py

A file name output_metrics.json will be generated with the results of the evaluation. You can find metrics such as precision, recall, f1-score and the list of false negatives.

The results of the evaluation using our corpus are: precision: 1.0 recall: 0.918918918918919 f1-score: 0.9577464788732395

Cite RSEF

Please refer to our Mining Software Repositories 2024 paper:

@InProceedings{garijo_2024_bidirectional,
    author="Garijo, Daniel and Arroyo, Miguel and Gonzalez, Esteban and Treude, Christoph and Tarocco, Nicola",
    title="Bidirectional Paper-Repository Tracing in Software Engineering",
    booktitle="21st International Conference on Mining Software Repositories",
    year="2024",
    publisher="ACM",
    address="Cham",
    doi="10.1145/3643991.3644876",
    url= {https://dgarijo.com/papers/msr_2024.pdf}
}

License

This project is licensed under the MIT License.

About

Pipeline to extract software implementations from research publications

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 96.3%
  • Jupyter Notebook 3.7%