This library allows for the easy creation of a machine learning pipeline that uses PubMed as its data source. Two kinds of pipelines can be made:
Setup Pipeline: Downloads all the papers from PubMed matching a specific search query and then applies a machine learning classifier to them before saving the output in parquet format
Update pipeline: handles the logic of downloading all the new and updated papers since the setup pipeline or the last update pipeline ran. The new and updated papers retrieved from PubMed are added to the main dataframe created in the setup pipeline as well as being stored in a separate dataframe which is written in parquet format.
Make sure you have python and pip installed.
If you do not have git installed, follow these instructions to install it.
- Clone this repository (or alternatively download it directly from the github page):
git clone https://github.com/nicford/Pubmed-Pipeline.git
- In your terminal, navigate into the cloned/downloaded folder. Run the following command to install the Pubmed Pipeline library:
pip install pubmed_pipeline
-
Install other required dependencies:
Follow these instructions to install parallel.
Follow these instructions to install xmlstarlet.
Follow these instructions to install wget.
Follow these instructions to install curl.
To create a pipeline object, you need to pass in a spark session. Thus, you must configure your spark session beforehand. If you are unfamiliar with spark sessions, you can get started here. Note: if you are using Databricks, a spark session is automatically created called "spark".
If you do not have your own PubMed XML data and you wish to download XML paper data from PubMed, you need a PubMed API key. This API key can be obtained by doing the following:
"Users can obtain an API key now from the Settings page of their NCBI account. To create an account, visit http://www.ncbi.nlm.nih.gov/account/."
The setup pipeline class allows you to create a pipeline.
See below how to use this setup pipeline.
from pubmed_pipeline import PubmedPipelineSetup
XMLFilesDirectory = "" # path to save downloaded XML content from pubmed or path to XML data if you already have some
numSlices = "" # The numSlices denote the number of partitions the data would be parallelized to
searchQueries = [""] # list of strings for queries to search pubmed for
apiKey = "" # API key from pubmed to allow increased rate of requests, to avoid HTTP 429 error(see E-utilites website for how to get a key)
lastRunDatePath = "" # path to store a pickle object of the date when the setup is run (this is the same path to provide to the update job)
classifierPath = "" # path to the classifier used to classify papers
dataframeOutputPath = "" # path to store the final dataframe to in parquet form
# your Spark session configuration
sparkSession = SparkSession.builder \
.master("local") \
.appName("") \
.config("spark.some.config.option", "some-value") \
.getOrCreate()
# create the setup pipeline
setupJob = PubmedPipelineSetup(sparkSession, XMLFilesDirectory, classifierPath, dataframeOutputPath, numSlices, lastRunDatePath)
# This downloads all the required papers from PubMed under the searchQueries
setupJob.downloadXmlFromPubmed(searchQueries, apiKey)
# This runs the pipeline and saves the classified papers in dataframeOutputPath
setupJob.runPipeline()
The update pipeline class allows you to update your database of papers since the setup pipeline was run, or since the last update was run.
See below how to use this update pipeline.
from pubmed_pipeline import PubmedPipelineUpdate
XMLFilesDirectory = "" # path to save downloaded xml content from pubmed
numSlices = "" # The numSlices denote the number of partitions the data would be parallelized to
searchQueries = [""] # list of strings for queries to search pubmed for
apiKey = "" # API key from pubmed to allow increased rate of requests, to avoid HTTP 429 error(see E-utilites website for how to get a key)
lastRunDatePath = "" # path containing a pickle object of the last run date (running setup job creates one)
classifierPath = "" # path to the classifier used to classify papers
dataframeOutputPath = "" # path to store the final dataframe to in parquet form
newAndUpdatedPapersPath = "" # path to store the dataframe containing the new and updated papers
# your Spark session configuration
sparkSession = SparkSession.builder \
.master("local") \
.appName("") \
.config("spark.some.config.option", "some-value") \
.getOrCreate()
# create the update pipeline
updateJob = PubmedPipelineUpdate(sparkSession, XMLFilesDirectory, classifierPath, dataframeOutputPath, numSlices, lastRunDatePath, newAndUpdatedPapersPath)
# This downloads all the required papers from pubmed under the searchQueries
updateJob.downloadXmlFromPubmed(searchQueries, apiKey)
# This runs the pipeline and saves the new and updated classified papers in newAndUpdatedPapersPath
# The pipeline also handles the logic to add new papers and remove any papers from the main dataframe which are no longer relevant
updateJob.runPipeline()
If you wish to customise the library to meet your own needs, please fork the repository and do the following:
To customise the pipeline processes, change/add the functions in pubmedPipeline.py To customise the downloading of XML metadata, change setupPipeline.sh and updatePipeline.sh
We would like to acknowledge the following projects:
and the following libraries: