Increased identifications based on settings #1360

cyhofe · 2025-01-24T08:09:12Z

cyhofe
Jan 24, 2025

Dear DIANN community,

At the moment I am searching multiple bacterial metaproteomes of an environmental time series. For each sample, I measured two biological replicates from the same day/system. Each sample was individually extracted, prepared and measured in a sequential manner.

When changing parameters in DIANN 1.9.2 (linux command-line) I saw that some have quite an effect on protein and peptide recovery. Therefore, I hoped to ask if someone could help me to verify the settings for my purpose as I have a hard time to understand if my settings indeed fit the underlying samples and experiment. The main goal of this study is to scree the metaproteome in an explorative manner, trying to identify important and abundant proteins in the system.

Before running the search, I generated a database like this:
nohup singularity exec /home/opt/progs/DIANN_1.9.2/diann-1.9.2.img /diann-1.9.2/diann-linux -v --threads 200 --verbose 1 --qvalue 0.01 --out-lib database.parquet --gen-spec-lib --predictor --fasta database.fasta --fasta-search --min-fr-mz 200 --max-fr-mz 1800 --met-excision --min-pep-len 7 --max-pep-len 30 --min-pr-mz 300 --max-pr-mz 1800 --min-pr-charge 1 --max-pr-charge 4 --cut K*,R* --missed-cleavages 1 --unimod4 --var-mods 1 --var-mod UniMod:35,15.994915,M --var-mod UniMod:1,42.010565,*n --peptidoforms --relaxed-prot-inf --rt-profiling --pg-level 0 --no-norm &

After that I used the generated database to search my samples:
nohup singularity exec /home/opt/progs/DIANN_1.9.2/diann-1.9.2.img /diann-1.9.2/diann-linux -v --dir "./samples/" --lib "database.predicted.speclib" --threads 200 --verbose 1 --out "diann-report.tsv" --qvalue 0.01 --matrices --unimod4 --var-mods 1 --var-mod UniMod:35,15.994915,M --var-mod UniMod:1,42.010565,*n --peptidoforms --reanalyse --relaxed-prot-inf --rt-profiling --pg-level 0 --no-norm &

During trials I noticed quite some differences when adding:
--reanalyse (enables MBR)
--relaxed-prot-inf (use a very heuristical protein inference algorithm)
--rt-profiling (IDs, RT and IM profiling)
--pg-level 0 (protein inference mode, with 0 - isoforms)
--no-norm (Disable normalization)

The main question I have here would be if I have to set --no-maxlfq whenever I use --no-norm. Also if it is indeed needed to put no-norm for my type of analysis. Also I was wondering if it is indeed valid to put relaxed-prot-inf in that case ? Moreover I'm not sure if I should use smart-profiling instead of rt-profiling ?

Here the description when generating the database:
DIA-NN 1.9.2 (Data-Independent Acquisition by Neural Networks)
Compiled on Oct 31 2024 04:27:44
Current date and time: Fri Dec 20 14:23:26 2024
Logical CPU cores: 256
Thread number set to 150
Output will be filtered at 0.01 FDR
A spectral library will be generated
Deep learning will be used to generate a new in silico spectral library from peptides provided
DIA-NN will carry out FASTA digest for in silico lib generation
Min fragment m/z set to 200
Max fragment m/z set to 1800
N-terminal methionine excision enabled
Min peptide length set to 7
Max peptide length set to 30
Min precursor m/z set to 300
Max precursor m/z set to 1800
Min precursor charge set to 1
Max precursor charge set to 4
In silico digest will involve cuts at K*,R*
Maximum number of missed cleavages set to 1
Cysteine carbamidomethylation enabled as a fixed modification
Maximum number of variable modifications set to 1
Modification UniMod:35 with mass delta 15.9949 at M will be considered as variable
Modification UniMod:1 with mass delta 42.0106 at *n will be considered as variable
Peptidoform scoring enabled
Heuristic protein grouping will be used, to reduce the number of protein groups obtained; this mode is recommended for benchmarking protein ID numbers, GO/pathway and system-scale analyses
The spectral library (if generated) will retain the original spectra but will include empirically-aligned RTs
Implicit protein grouping: isoform IDs; this determines which peptides are considered 'proteotypic' and thus affects protein FDR calculation
Normalisation disabled
The following variable modifications will be scored: UniMod:35 UniMod:1

and the description when analysing the samples with it:
DIA-NN 1.9.2 (Data-Independent Acquisition by Neural Networks)
Compiled on Oct 31 2024 04:27:44
Current date and time: Tue Jan 7 09:51:19 2025
Logical CPU cores: 256
Thread number set to 200
Output will be filtered at 0.01 FDR
Precursor/protein x samples expression level matrices will be saved along with the main report
Cysteine carbamidomethylation enabled as a fixed modification
Maximum number of variable modifications set to 1
Modification UniMod:35 with mass delta 15.9949 at M will be considered as variable
stefan@betula:~/cyrill/SpringBloom2021/DIANN_January2025/Searches/72Samples_p34548_SpringBloom2021_95MagSpeciesDB_15JUL24_20240726_Database_Run07JAN25/DIANN_Search$ cat nohup.out | head -n 30
DIA-NN 1.9.2 (Data-Independent Acquisition by Neural Networks)
Compiled on Oct 31 2024 04:27:44
Current date and time: Tue Jan 7 09:51:19 2025
Logical CPU cores: 256
Thread number set to 200
Output will be filtered at 0.01 FDR
Precursor/protein x samples expression level matrices will be saved along with the main report
Cysteine carbamidomethylation enabled as a fixed modification
Maximum number of variable modifications set to 1
Modification UniMod:35 with mass delta 15.9949 at M will be considered as variable
Modification UniMod:1 with mass delta 42.0106 at *n will be considered as variable
Peptidoform scoring enabled
A spectral library will be created from the DIA runs and used to reanalyse them; .quant files will only be saved to disk during the first step
Heuristic protein grouping will be used, to reduce the number of protein groups obtained; this mode is recommended for benchmarking protein ID numbers, GO/pathway and system-scale analyses
The spectral library (if generated) will retain the original spectra but will include empirically-aligned RTs
Implicit protein grouping: isoform IDs; this determines which peptides are considered 'proteotypic' and thus affects protein FDR calculation
Normalisation disabled
DIA-NN will optimize the mass accuracy automatically using the first run in the experiment. This is useful primarily for quick initial analyses, when it is not yet known which mass accuracy setting works best for a particular acquisition scheme.
WARNING: protein inference is enabled but no FASTA provided - is this intended?
The following variable modifications will be scored: UniMod:35 UniMod:1
Unless the spectral library specified was created by this version of DIA-NN, it's strongly recommended to specify a FASTA database and use the 'Reannotate' function to allow DIA-NN to identify peptides which can originate from the N/C terminus of the protein: otherwise site localization might not work properly for modifications of the protein N-terminus or for modifications which do not allow enzymatic cleavage after the modified residue

I want to thank you for any comment regarding that post. I have to admit I'm rather inexperienced in the field of metaproteomics which is why I invested some time into trying different settings and see what I get.

THANK YOU

Answered by vdemichev

Jan 27, 2025

Hi,

my settings indeed fit the underlying samples and experiment

In general, a good idea is to keep settings default with changes as recommended in DIA-NN docs, and no other changes. In 99% cases this is the best approach. This means no M(Ox) or N-term(Ac) modifications.

Before running the search, I generated a database like this:

Fine, except you don't want to include variable mods on the first try, as indicated above. If your data is dia-PASEF, makes sense to restrict the precursor charge range. In any case, makes sense to restrict the precursor m/z range to that of the experiment.

The main question I have here would be if I have to set --no-maxlfq whenever I use --no-norm.

No.

A…

View full answer

vdemichev · 2025-01-27T09:44:58Z

vdemichev
Jan 27, 2025
Maintainer

Hi,

my settings indeed fit the underlying samples and experiment

In general, a good idea is to keep settings default with changes as recommended in DIA-NN docs, and no other changes. In 99% cases this is the best approach. This means no M(Ox) or N-term(Ac) modifications.

Before running the search, I generated a database like this:

Fine, except you don't want to include variable mods on the first try, as indicated above. If your data is dia-PASEF, makes sense to restrict the precursor charge range. In any case, makes sense to restrict the precursor m/z range to that of the experiment.

The main question I have here would be if I have to set --no-maxlfq whenever I use --no-norm.

No.

Also if it is indeed needed to put no-norm for my type of analysis.

I don't think so. You can always apply your custom normalisation in R on top of DIA-NN's normalised quantities.

Also I was wondering if it is indeed valid to put relaxed-prot-inf in that case ?

Protein inference is in general quite 'unreliable'. So yes, I would keep in 'heuristical' but pay attention to the Protein.Ids column in your report and maybe rely on Proteotypic peptides only and respectively Genes.MaxLFQ.Unique - for this also a good idea to use --ids-to-names (replaces Gene names with sequence ids, better to avoid any confusion between orthoologues) for both predicted library generation and analysis.

Moreover I'm not sure if I should use smart-profiling instead of rt-profiling ?

No.

3 replies

cyhofe Jan 27, 2025
Author

THANK YOU very much for your quick answer and your fantastic tool. Your comments are very helpful.

Best,
Cyrill

wasdoff Mar 7, 2025

Hello Vadim,

Zooming in on this remark you made:

This means no M(Ox) or N-term(Ac) modifications.

Why is that? It's the setting I've always used in other software (and in my uses of DIA-NN so far), and I often identify plenty of peptides with these modifications. Are your concerns with speed, quantitation, handling multiple modifications at once, or something else maybe?

Kind regards,
Wouter

vdemichev Mar 9, 2025
Maintainer

I've always used in other software

If the other software does not ensure peptidoform confidence, it's unknown if the reported peptides ids are actually correct.

and I often identify plenty of peptides with these modifications.

But in 99% of cases the number of proteins will stay the same, while the search takes longer.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Increased identifications based on settings #1360

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment 3 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Increased identifications based on settings #1360

Uh oh!

Uh oh!

cyhofe Jan 24, 2025

Replies: 1 comment · 3 replies

Uh oh!

vdemichev Jan 27, 2025 Maintainer

Uh oh!

cyhofe Jan 27, 2025 Author

Uh oh!

wasdoff Mar 7, 2025

Uh oh!

vdemichev Mar 9, 2025 Maintainer

cyhofe
Jan 24, 2025

Replies: 1 comment 3 replies

vdemichev
Jan 27, 2025
Maintainer

cyhofe Jan 27, 2025
Author

vdemichev Mar 9, 2025
Maintainer