This Repository contains the complete software and documentation to execute the Long-Read-Proteogenomics Workflow.
Sequence Read Archive (SRA) Project Reference | Description |
---|---|
PRJNA783347 | Long-Read RNA Sequencing Project for Jurkat Samples |
PRJNA193719 | Short-Read RNA Sequencing Project for Jurkat Samples |
Updated: 2022 January 30
This is the repository for the Long-Read Proteogenomics workflow. Written in Nextflow
, it is a modular workflow beneficial to both the Transcriptomics
and Proteomics
fields. The data from both Long-Read IsoSeq sequencing
with PacBio
and Mass spectrometry-based proteomics
used in the classification and analysis of protein isoforms expressed in Jurkat
cells and described in the publication Enhanced protein isoform characterization through long-read proteogenomics
, which will be made public in Fall 2022.
The output data resulting from the execution of this workflow for the Manuscript: Enhanced Protein Isoform Characterization through Long Read Proteogenomics. May be found here [insert Zenodo Reference here]. The Analysis to produce the figures for the manuscript may be found in the companion repository Long-Read Proteogenomics Analysis
A goal in the biomedical field is to delineate the protein isoforms that are expressed and have pathophysiological relevance. Towards this end, new approaches are needed to detect protein isoforms in clinical samples. Mass spectrometry (MS) is the main methodology for protein detection; however, poor coverage and incompleteness of protein databases limit its utility for isoform-resolved analysis. Fortunately, long-read RNA-seq approaches from PacBio and Oxford Nanopore platforms offer opportunities to leverage full-length transcript data for proteomics.
We introduce enhanced protein isoform detection through integrative “long read proteogenomics”. The core idea is to leverage long-read RNA-seq to generate a sample-specific database of full-length protein isoforms. We show that incorporation of long read data directly in the MS protein inference algorithms enables detection of hundreds of protein isoforms intractable to traditional MS. We also discover novel peptides that confirm translation of transcripts with retained introns and novel exons. Our pipeline is available as an open-source Nextflow pipeline, and every component of the work is publicly available and immediately extendable.
Proteogenomics is providing new insights into cancer and other diseases. The proteogenomics field will continue to grow, and, paired with increases in long-read sequencing adoption, we envision use of customized proteomics workflows tailored to individual patients.
We acknowledge the beginning kernels of this work were formed during the Fall of 2020 at the Cold Spring Harbor Laboratory Biological Data Science Codeathon
.
We acknowledge Lifebit and the use of their platform Lifebit's CloudOS key in development of the open source software Nextflow workflow used in this work.
This workflow is complex, bringing together two measurement technologies in a long-read proteogenomics approach for integrating sample-matched long-read RNA-seq and MS-based proteomics data to enhance isoform characterization. To orient the user with the steps involved in the transformation of raw measurement data to these fully resolved, identified and annotated results, we have developed this quick start, wiki documentation including vignettes.
This repository is organized into modules and parts of this repository could be useful to different researchers to annotate their own raw data. The workflow is written in Nextflow
, allowing it to be run on virtually any platform with alterations to the configurations and other adaptations. The visitor is encourated to fork clone and adapt and contribute. All are encouraged to use GitHub Issues
to communicate with the contributors to this open source software project. Software addtions, modifications and contributions are done through GitHub Pull Requests
Module processes details are documented within the Wiki
within this repository. As well as linked to the third party resources used in this workflow.
Vignettes have been developed to go into greater detail and walk the visitor through the visualization capabilities of the final annotated results
and to walk the visitor through the workflow with presented here with the quick start
This quick start and steps were performed on a MacBook Pro running BigSur Version 11.4 with 16 GB 2667 MHz DDR48 RAM and a 2.3 GHz 8-Core Intel Core i9 processor.
The visitor will be walked through the pre-requisites, clone the library and execute with demonstration data also used in the GitHub Actions
.
In this quick start, Dockerhub Desktop Application for the Mac with an Intel Chip
was used.
Follow the instructions there to install.
On the MacBook Pro running BigSur Version 11.4 with 16 GB Ram, It was necessary to configure the Dockerhub resources to use 6GB
of Ram.
On the MacBook Pro, the 64-bit version of miniconda was downloaded and installed
follow the installation instructions.
To begin, open a terminal window, ensuring the miniconda installation has completed, reboot the terminal shell.
On the Mac, this is done within a zsh
shell environment.
exec -l zsh
If you already have the environment, you can see what conda environments you have with the following commnad:
conda info --envs
If you haven't already created a conda environment for this work, create and activate it now.
conda create -n lrp
conda activate lrp
Install and set the Nextflow version.
conda install -c bioconda nextflow -y
export NXF_VER=20.01.0
Now with the environment ready, we can clone.
git clone https://
.com/sheynkman-lab/Long-Read-Proteogenomics
cd Long-Read-Proteogenomics
This Quick start uses the test_without_sqanti.config
configuration file found in the conf
directory of this repository.
nextflow run main.nf --config conf/test_without_sqanti.config
For details regarding the processes and results produced, please see the Wiki
and the Vignette: Workflow with test data
.
To visualize results, please see the visualization capabilities of the final annotated results
.
The sheynkman-lab/Long-Read-Proteogenomics pipeline comes with details about each of the processes that make up the pipeline are found in the Wiki
. In this you will find:
Third-party tools
Input parameters
Output files
Pipeline processes descriptions
Vignette: Visualization
Vignette: Workflow with test data
The workflow accepts as input raw PacBio data and performs the assembly of predicted protein isoforms with high probability of existing in the sample. This database is then used in MetaMorpheus to search raw mass spectrometry data against the PacBio reference. MetaMorpheus will use protein isoform read counts during protein inference. Two other protein databases are employed for the purposes of comparison. One is from UniProt and the other is from GENCODE. A series of Jupyter notebooks can be used to perform all final comparisons and data analysis.
To make the data more accessible and FAIR, the indexed files were transferred to Zenodo using zenodo-upload
from the University of Virginia's Gloria Sheynkman Lab
Amazon S3
buckets.
Using Nextflow, configuration items can access locations in Google Compute Platform (GCP) buckets (gs://
), Amazon Web Services (AWS) buckets (s3://
) and Zenodo locations (https://
) seamlessly.
The main reasons why ZENODO vs AWS S3: or GCP GS: are:
Data versioning
(of primary importance): In S3 or GS buckets, data can be overwritten for the same path at any point, possibly breaking the pipeline.Cost
: These datasets are tiny but the principle stays: The less storage the betterAccess
: Most users of the pipeline can most easily accessZENODO
and will be able to use the data. AWS and GCP has an entry barriers.
Details on how these data were transferred and moved from AWS S3:
buckets are described in the AWS to Zenodo
.
- Christina Chatzipantsiou
- Benjamin Jordan
- Simran Kaur
- Raymond Leclair
- Anne Deslattes Mays
- Madison Mehlferber
- Rachel M. Miller
- Robert J. Millikin
- Kyndalanne Pike
- Gloria M. Sheynkman
- Michael R. Shortreed
- Isabella Whitworth
This is a joint project between the Sheynkman Lab, the Smith Lab, Lifebit and Science and Technology Consulting, LLC.
This pipeline was generated using a modification of the nf-core template.
You can cite the nf-core
publication as follows:
The nf-core framework for community-curated bioinformatics pipelines.
Philip Ewels, Alexander Peltzer, Sven Fillinger, Harshil Patel, Johannes Alneberg, Andreas Wilm, Maxime Ulysse Garcia, Paolo Di Tommaso & Sven Nahnsen.
Nat Biotechnol. 2020 Feb 13. doi: 10.1038/s41587-020-0439-x. ReadCube: Full Access Link