Skip to content

treko90/tarsus

Repository files navigation

Tarsus (TArgeted SUbstitution Search)

This is the overview of the pipeline described in Tretyachenko, V., Leiman, T., Dahan, O., Asraf, O., Dahary, D., & Pilpel, Y. T. (2025). Encoded and non-genetic alternative protein variants expand human functional proteome. bioRxiv, 2025-02. https://www.biorxiv.org/content/10.1101/2025.02.17.638604v1

The pipeline processes raw mass spectrometric data by three open-source tools and outputs validated list of single amino acid substitutions (SAAV) in comparison to the reference database. Validation step is computationally intensive and requiries parallel processing on computational cluster. The scripts in this repository are adapted to our cluster environment and will necessitate a custom modifications towards your system. The simple scripts provided here serve as a guidance for pipeline adoption towards different environments. The purpose is to exemplify the usage of open-source software packages into the unified process.

Software requirements:

FragPipe v23.1 - Available from https://github.com/Nesvilab/FragPipe

PepQuery 2.0.2. Available from http://pepquery.org/download.html

PDV 2.2.0 - Available from https://github.com/wenbostar/PDV

Python 3.11.5 with polars and pandas packages installed

How to run the pipeline:

  1. convert your raw MS files to mzML format either using msconvert or ThermoRawFileParser.

  2. create PepQuery index from mzML files. 'make_index.sh path_to_mzmML output_path' will create index from the mzML files in path_to_mzmML and put it in pepquery_index directory in the out_path. You should define the location of PepQuery within the script. For PepQuery indexing reference see http://pepquery.org/document.html#index

  3. run FragPipe's two-pass search. First it finds reference database PSM, substracts these spectra from the input mzML and creates remainder sub.mzML files with only unidentified spectra 'search_pipeline.sh fasta_file output_path path_to_mzML'

    • adds decoys to the searched fasta files with Philosopher included in FragPipe suite
    • adds the path to this new database with the decoys to the generic.workflow files necessary for the first search
    • runs the first search using this workflow (mofify generic.workflow if necessary)
    • takes all detected peptides from peptides.tsv and in silico mutagenizes them to create a second search database
    • in silico mutagenesis is performed by make_peptides.py script. You can control which substitutions to exclude from the mutagenesis by modifying subs_exclude.csv. Columns are origin amino acids, rows are destination amino acids. True value excludes origin->destination mutation, False includes origin->destination mutation in the library
    • runs Philosopher again to add decoys to this new database
    • modifies fragpipe-second-pass.workflow file to include this new database, sets nocleavage parameter to MSFragger so peptides in the database are not cleaved and searched as is and sets minisotopes and minscans parameters in IonQuant to 1 as substitutions are usually rare. Modify these parameters if necessary. Set your directories with Philosopher and FragPipe. For FragPipes two-stage search reference see https://fragpipe.nesvilab.org/docs/tutorial_two_pass_search.html
  4. Extract detected peptide sequences and corresponding spectrum titles from the psm.tsv file generated by the second search 'python extract_psm.tsv second_search_output_path' will create psm_input4pepquery.tsv file with peptide-spectrum_title rows required for PepQuery PSM validation search. Courtesy of Wen Bo.

  5. run PepQuery PSM validation of each PSM detected by the second search. This step requires a computational cluster as thousands of PSMs need to be validated in parallel. 'query_psm.sh first_search_output_path fasta_path' will:

    • split the psm_input4pepquery.tsv into smaller 10-row chunks
    • load each small chunk as a separate cluster job. Modify your job submission according to your cluster parameters. Modify PepQuery parameters to fit to your MS search parameters. By deafult PepQuery performs more stringent (-hc flag) search and includes competitions against all Unimod modifications and reference database peptide substitutions (-aa flag), fast search is not activated. fasta_path MUST contain fasta file named protein.faa or protein.fasta. For PepQuery parameters reference see http://pepquery.org/document.html#saparameter
  6. aggregate outputs from all separate PepQuery searches - psm_rank.txt, ptm_detail.txt and psm_rank.mgf into the psmrankall.txt, ptmall.txt and mgfall.txt files. 'aggregate_query.sh path_to_pepquery_result_dir' will create all aggregated outputs and run PDV on psmrankall and mgfall files to extract the tested peptide fragment ion series for each PSM tested by the PepQuery. For PepQuery output reference see http://pepquery.org/document.html#saoutput

  7. analyze the outputs and output the substitution list 'python analyze_outputs.py first_search_output_path/' checks each tested peptide and the PepQuery scoring from every competitive match in ptmall.txt. It filters only for PSM's associated with amino acid substitutions and selects PSMs with the maximum score and single suggested sequence variant (no alternative peptide variants). Additionally it filters out the PSM's where the substituted positions in covered by only one fragment ion. Also it filters out all substitutions found in N-terminus of the peptide due to decreased confidence in such detections.

  8. prepare input for rescoring of identified alternative peptide-spectrum matches with PepQuery against their reference peptide counterparts 'python ref_validation_prep.py first_search_output_path/' loads PepQuery-generated MGF file with all matched spectra and creates 'fake' MGF file with the PEPMASS values corresponding to reference peptides m/z and not alternative peptide. The reason for this is that PepQuery would not consider MS2 spectrum for matching with the reference peptide if precursor mass does not correspond to it. Next, pepquery_input_base directory is created with single line inputs for validation. Inputs contain the spectrum title from this new 'fake' MGF file and corresponding reference peptide sequence to test it against. CAM on cysteine is considered variable modification this time.

  9. run the PepQuery rescoring of identified spectra against reference peptides 'query_psm_base.sh first_search_output_path fasta_path' runs pepquery validation for reference (base) peptides in the 'fake' MGF as a spectral database.

  10. aggregate results of rescoring search 'aggregate_query_base.sh path_to_pepquery_result_recheck_base_dir' collects all pepquery outputs and aggregates them into similar files as in the step 6). Now for the reference peptide validation.

  11. localize substitutions and report only confident hits 'python localize.py first_search_output_path' filters out alternative peptide PSMs which with lower hyperscore that reference proteome PSMs, performs localization of the mass shift using pyAscore (see imported modules necessary to install) and outputs subs_localized.ipc with the final list of validated substitutions

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published