Skip to content

A library that facilitates the process of building a machine learning pipeline using PubMed as a data source

License

Notifications You must be signed in to change notification settings

nicford/Pubmed-Pipeline

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

27 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Pubmed Pipeline Python Library

Overview

This library allows for the easy creation of a machine learning pipeline that uses PubMed as its data source. Two kinds of pipelines can be made:

Setup Pipeline: Downloads all the papers from PubMed matching a specific search query and then applies a machine learning classifier to them before saving the output in parquet format

Update pipeline: handles the logic of downloading all the new and updated papers since the setup pipeline or the last update pipeline ran. The new and updated papers retrieved from PubMed are added to the main dataframe created in the setup pipeline as well as being stored in a separate dataframe which is written in parquet format.

Requirements

python3+

pip

git

parallel

xmlstarlet

wget

curl

Installation

Make sure you have python and pip installed.

If you do not have git installed, follow these instructions to install it.

  1. Clone this repository (or alternatively download it directly from the github page):
git clone https://github.com/nicford/Pubmed-Pipeline.git
  1. In your terminal, navigate into the cloned/downloaded folder. Run the following command to install the Pubmed Pipeline library:
pip install pubmed_pipeline
  1. Install other required dependencies:

    Follow these instructions to install parallel.

    Follow these instructions to install xmlstarlet.

    Follow these instructions to install wget.

    Follow these instructions to install curl.

Usage

Requirements

1. Spark Session

To create a pipeline object, you need to pass in a spark session. Thus, you must configure your spark session beforehand. If you are unfamiliar with spark sessions, you can get started here. Note: if you are using Databricks, a spark session is automatically created called "spark".

2. API KEY (optional, for XML downloads)

If you do not have your own PubMed XML data and you wish to download XML paper data from PubMed, you need a PubMed API key. This API key can be obtained by doing the following:

"Users can obtain an API key now from the Settings page of their NCBI account. To create an account, visit http://www.ncbi.nlm.nih.gov/account/."

Setup Pipeline

The setup pipeline class allows you to create a pipeline.

See below how to use this setup pipeline.

from pubmed_pipeline import PubmedPipelineSetup

XMLFilesDirectory = ""     # path to save downloaded XML content from pubmed or path to XML data if you already have some
numSlices = ""             # The numSlices denote the number of partitions the data would be parallelized to
searchQueries = [""]       # list of strings for queries to search pubmed for
apiKey = ""                # API key from pubmed to allow increased rate of requests, to avoid HTTP 429 error(see E-utilites website for how to get a key) 
lastRunDatePath = ""       # path to store a pickle object of the date when the setup is run (this is the same path to provide to the update job)
classifierPath = ""        # path to the classifier used to classify papers
dataframeOutputPath = ""   # path to store the final dataframe to in parquet form

# your Spark session configuration
sparkSession = SparkSession.builder \    
                       .master("local") \
                       .appName("") \
                       .config("spark.some.config.option", "some-value") \
                       .getOrCreate()

# create the setup pipeline 
setupJob = PubmedPipelineSetup(sparkSession, XMLFilesDirectory, classifierPath, dataframeOutputPath, numSlices, lastRunDatePath)

# This downloads all the required papers from PubMed under the searchQueries
setupJob.downloadXmlFromPubmed(searchQueries, apiKey)

# This runs the pipeline and saves the classified papers in dataframeOutputPath
setupJob.runPipeline()

Update Pipeline

The update pipeline class allows you to update your database of papers since the setup pipeline was run, or since the last update was run.

See below how to use this update pipeline.

from pubmed_pipeline import PubmedPipelineUpdate

XMLFilesDirectory = ""          # path to save downloaded xml content from pubmed
numSlices = ""                  # The numSlices denote the number of partitions the data would be parallelized to
searchQueries = [""]            # list of strings for queries to search pubmed for
apiKey = ""                     # API key from pubmed to allow increased rate of requests, to avoid HTTP 429 error(see E-utilites website for how to get a key) 
lastRunDatePath = ""            # path containing a pickle object of the last run date (running setup job creates one)
classifierPath = ""             # path to the classifier used to classify papers
dataframeOutputPath = ""        # path to store the final dataframe to in parquet form
newAndUpdatedPapersPath = ""    # path to store the dataframe containing the new and updated papers

# your Spark session configuration
sparkSession = SparkSession.builder \
                       .master("local") \
                       .appName("") \
                       .config("spark.some.config.option", "some-value") \
                       .getOrCreate()

# create the update pipeline 
updateJob = PubmedPipelineUpdate(sparkSession, XMLFilesDirectory, classifierPath, dataframeOutputPath, numSlices, lastRunDatePath, newAndUpdatedPapersPath)

# This downloads all the required papers from pubmed under the searchQueries
updateJob.downloadXmlFromPubmed(searchQueries, apiKey)

# This runs the pipeline and saves the new and updated classified papers in newAndUpdatedPapersPath
# The pipeline also handles the logic to add new papers and remove any papers from the main dataframe which are no longer relevant
updateJob.runPipeline()

Customisation of library

If you wish to customise the library to meet your own needs, please fork the repository and do the following:

To customise the pipeline processes, change/add the functions in pubmedPipeline.py To customise the downloading of XML metadata, change setupPipeline.sh and updatePipeline.sh

Core Developers

Nicolas Ford

Yalman Ahadi

Paul Lorthongpaisarn

Dependencies

We would like to acknowledge the following projects:

and the following libraries:

About

A library that facilitates the process of building a machine learning pipeline using PubMed as a data source

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published