pep2gene

Match peptide search results from TPP or MSPLIT to genes.

Motivation
Rules
Installation
Usage
Output

Motivation

Matching peptides to the correct protein and protein isoform can be a challenging task requiring complicated rulesets that can create a confusing picture of the actual sample composition. If the principal interest is actually at the gene level, one way to simplify the interpretation of results is to match peptides to genes, since much of the complexity at the protein level is due to different transcripts arising from a single gene sequence. pep2gene was created to perform this task of matching peptides to genes.

Rules

Peptides are matched via protein sequence to the corresponding gene identifier. A peptide that matches to multiple proteins that arise from a single gene will count as a single unique peptide match to the gene and any spectral counts for that peptide will be assigned to the gene. If a peptide matches to more than a single gene, whether the peptide gets assigned to the gene, and what portion of its spectral counts get assigned depend on the following rules.

If a peptide matches to multiple genes (shared peptide), and there is evidence that each of those genes is present in the sample, i.e. each gene has at least one unique peptide, then the shared peptide will be assigned to all of these matching genes and each gene will get a portion of the shared peptide's spectral counts relative to the evidence for each gene's existence. For example, if one gene has two unique peptides and another has one unique peptide, the first will get twice as many spectral counts from the shared peptide.
If a peptide matches to multiple genes but only a subset of those genes have unique peptides, the shared peptide will only be assigned to those genes for which there is definite evidence they are in the sample, i.e. have at least one unique peptide. The spectral counts for the shared peptide will be apportioned as in rule 1.
If a gene (A) matches to a peptide or peptides, but those same peptides also match to another gene (B) and that other gene has additional evidence for its existence, gene A is considered to be subsumed by B and will be listed as such in the output summary for gene B.
If two (or more) genes match to the exact same set of peptides and their is no evidence favouring the presence of one gene over the other, both genes are considered to be present and will evenly split the spectral counts of the shared peptides.

Installation

This was built as a GO module using go1.12.7. If you have GO installed, you can build/install the binary. Otherwise it can be run as a container, using Docker for example.

GO executable

Ensure GO is installed.
Clone repo

git clone https://github.com/gingraslab/pep2gene.git
cd pep2gene

Build executable

go build

The executable will be called pep2gene.

Docker

Pull image (and rename - optional)

docker pull ghcr.io/gingraslab/pep2gene:v1.4.0
docker tag ghcr.io/gingraslab/pep2gene:v1.4.0 pep2gene

Check for versions.

Build

Clone repo

git clone https://github.com/gingraslab/pep2gene.git
cd pep2gene

Build the image

For Docker:

docker build -t pep2gene -f docker/standard/Dockerfile .

For Singularity:

docker build -t pep2genesing -f docker/singularity/Dockerfile .

We do not provide a Singularity definition file but the Docker image can be used with Singularity provided it is built from the correct source. The Dockerfile found in docker/standard/ was designed for Docker itself. While the image is small (~7mb), it does not work with Singularity. The Dockerfile found in docker/singularity/ will build an image compatibly with Singularity although it is about twice the size (13MB).

The images are also hosted at GitHub in separate repos: Docker and Singularity.

Usage

GO executable

pep2gene -db="database.fasta" -file="sample.pepxml" -enzyme="trypsin"

Docker

docker run -v $(pwd):/files/ pep2gene -db="database.fasta" -file="sample.pepxml" -enzyme="trypsin"

Singularity

singularity run -B ./:/files/ docker://ghcr.io/gingraslab/pep2genesing:v1.4.0 -db="database.fasta" -file="sample.pepxml" -enzyme="trypsin"

The database and peptide file must be located in the working directory Docker/Singularity is called from. Relative or nested paths will not work, i.e. ./some-directory/database.fasta or ../database.fasta. The output file will also be written to the working directory.

Flags

General

Name	Description	Required	Default
-db	FASTA database	true
-enzyme	digestion enzyme	false
-file	peptide file	true
-ignoreinvalid	ignore sequences with an invalid header	false	true
-inferenzyzme	infer the digestive enzyme	false	false
-missedcleavages	number of missed cleavages	false	0
-output	output file format	false	json
-pipeline	search pipeline	false	TPP

MSPLIT

Name	Description	Required	Default
-fdr	MSPLIT peptide FDR	false	0.01

OpenSWATH

Name	Description	Required	Default
-ignoreDecoys	ignore decoy peptides	false	true
-mscore	m_score for filtering	false	0.05
-mscorepeptideexperimentwide	m_score_peptide_experiment_wide	false	0.01
-peakgrouprank	peak_group_rank for filtering	false	1

TPP

Name	Description	Required	Default
-pepprob	TPP peptide probability	false	0.85

Notes

-db (database)

The search database is expected to be in FASTA format, with headers containing the following string

gn|<gene symbol>:<Entrez gene ID>

E.G:

>gi|22538794|gn|PDCD10:11235| programmed cell death protein 10 [Homo sapiens]

-enzyme

If an enzyme is specified, the sequence database will be digested before peptide matching begins. This significantly speeds up the matching process. If no enzyme is used, peptides are matched against the any protein subsequence.

The available enzymes are:

arg-c
asp-n
asp-n_ambic
chymotrypsin
cnbr
lys-c
lys-c/p
lys-n
pepsina
trypchymo
trypsin
trypsin/p
v8-de
v8-e

-fdr

The FDR is used for parsing high-quality peptides from MSPLIT results, both DDA and DIA. It is ignored when parsing TPP results.

-file

pepXML files from TPP are supported, as are DDA and DIA output files from MSPLIT, as well as DIA files from OpenSwath.

-ignoredecoys

Ignore decoy peptides. Currently only implemented for OpenSWATH results.

-ignoreinvalid

Sequences that do not conform to the required header format

gn|<gene symbol>:<Entrez gene ID>

will be ignored by default since pep2gene will not know how to parse the gene symbol and gene ID, both of which are required. This can be overridden by setting this argument to false. When this argument is set to false, any sequences for which a symbol and ID can not be determined will be identified by any leading non-whitespace characters in the header, and will be prefixed with p- to indicate they do not conform.

-inferenzyme

pep2gene can infer the enzyme used to digest the sample, rather that requiring it to be input as an argument. However, currently the enzyme name can only be parsed from pepXML files that contain the sample_enzyme field:

<sample_enzyme name="trypsin">

The name of the enzyme must match one of the names listed above.

-mscore

m_score for filtering OpenSWATH results. Peptides with an m_score less than or equal to this value will be used.

-mscorepeptideexperimentwide

m_score_peptide_experiment_wide for filtering OpenSWATH results. Peptides with an m_score_peptide_experiment_wide less than or equal to this value will be used.

-output

Results can be output in either json (default) or txt format. The txt format is a legacy format that we do not recommend using. See the Output section for a detailed description of each format.

-peakgrouprank

Peak group rank (peak_group_rank) to filter OpenSwath results by. The default is 1 so peptides with that value will be used. A value of 2 would use peptides with a value of either 1 or 2.

-pepprob

The peptide probability for parsing high-quality peptides from TPP results. It is ignored when parsing MSPLIT results.

-pipeline

The analysis pipeline used for searching peptides. The options are:

MSPLIT_DDA
MSPLIT_DIA
OPENSWATH
TPP

Output

json

The json format will contain fields for user-supplied command line arguments, for example the database and file names, and a genes object indexed by gene ID for each gene identified in the sample.

gene fields

gene field	definition
name	gene name/symbol
peptides	peptides assigned to the gene
sharedIDs	any other genes (by ID) it shares peptides with
sharedNames	any other genes (by name) it shares peptides with
spectralCount	total spectral count for the gene
subsumed	subsumed genes
unique	peptides unique to the gene
uniqueShared	peptides unique to the gene group, if the gene shares peptides

peptide fields

peptide field	definition
allottedSpectralCount	the portion of the peptide's spectral count allotted to the gene
totalSpectralCount	the total spectral count for the peptide in the sample
unique	a boolean indicating if the peptide is unique to the gene
uniqueShared	a boolean indicating if the peptide is unique to the group the gene shares peptides with

{
  "database": "database.fasta",
  "enzyme": "trypsin",
  "file": "sample.pepxml",
  "genes": {
    "5825": {
      "name": "ABCD3",
      "peptides": {
        "DQVIYPDGR": {
          "allottedSpectralCount": 1,
          "totalSpectralCount": 1,
          "unique": true,
          "uniqueShared": false
        },
        "FDHVPLATPN[115]GDVLIR": {
          "allottedSpectralCount": 1,
          "totalSpectralCount": 1,
          "unique": true,
          "uniqueShared": false
        }
      },
      "sharedIDs": [],
      "sharedNames": [],
      "spectralCount": 2,
      "subsumed": [],
      "unique": 2,
      "uniqueShared": 0
    },
    "60": {
      "name": "ACTB",
      "peptides": {
        "AGFAGDDAPR": {
          "allottedSpectralCount": 2.5,
          "totalSpectralCount": 5,
          "unique": false,
          "uniqueShared": true
        },
        "DLTDYLMK": {
          "allottedSpectralCount": 2.5,
          "totalSpectralCount": 5,
          "unique": false,
          "uniqueShared": false
        }
      },
      "sharedIDs": ["71"],
      "sharedNames": ["ACTG1"],
      "spectralCount": 5,
      "subsumed": ["100996820", "345651", "445582", "653269", "653781", "728378"],
      "unique": 0,
      "uniqueShared": 1
    }
  }
}

txt

The txt format contains less information than the json format and is not recommended.

The first two lines are headers, followed by gene entries separated by newlines. The first header line contains the keys for the summary line of each hit. In the example below the HitNumber for the first hit is Hit_1, the Gene is ABCD3, the GeneID is 5825, the SpectralCount is 4.00, the number of Unique peptides is 4 and there are no Subsumed genes for the hit. Since spectral counts for peptides can be divided between genes, the spectral count is reported as a floating-point number.

The second gene entry is for a shared group, i.e. the members or this group perfectly share a set of peptides: in this example ACTB and ACTG1, corresponding to the gene IDs 60 and 71 respectively. This group subsumes several other genes indicated by their IDs.

The summary line for each hit is followed by its assigned peptides. Each peptide has a TotalSpectralCount referring to the total number of spectral counts detected for it in the sample and a yes/no indicator to declare its uniqueness to the gene hit.

HitNumber;;Gene;;GeneID;;SpectralCount;;Unique;;Subsumed
Peptide;;TotalSpectralCount;;IsUnique

Hit_1;;ABCD3;;5825;;4.00;;4;;
DQVIYPDGR;;1;;yes
FDHVPLATPN[115]GDVLIR;;1;;yes
IANPDQLLTQDVEK;;1;;yes
ITELMQVLK;;1;;yes

Hit_2;;ACTB, ACTG1;;60, 71;;8.56;;0;;100996820, 345651, 445582, 653269, 653781, 728378
AGFAGDDAPR;;2;;no
DLTDYLMK;;2;;no
DLYANTVLSGGTTMYPGIADR;;3;;no
DLYANTVLSGGTTM[147]YPGIADR;;1;;no
DSYVGDEAQSK;;2;;no
EITALAPSTMK;;1;;no

Name		Name	Last commit message	Last commit date
Latest commit History 34 Commits
digestion		digestion
docker		docker
fs		fs
helpers		helpers
match		match
output		output
read		read
stats		stats
types		types
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
go.mod		go.mod
go.sum		go.sum
main.go		main.go
parseflags.go		parseflags.go
parseflags_test.go		parseflags_test.go

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

pep2gene

Motivation

Rules

Installation

GO executable

Docker

Pull image (and rename - optional)

Build

Usage

GO executable

Docker

Singularity

Flags

General

MSPLIT

OpenSWATH

TPP

Notes

Output

json

gene fields

peptide fields

txt

About

Releases

Packages

Languages

License

gingraslab/pep2gene

Folders and files

Latest commit

History

Repository files navigation

pep2gene

Motivation

Rules

Installation

GO executable

Docker

Pull image (and rename - optional)

Build

Usage

GO executable

Docker

Singularity

Flags

General

MSPLIT

OpenSWATH

TPP

Notes

Output

json

gene fields

peptide fields

txt

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages