Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Differential splicing analysis. #23

Merged
merged 57 commits into from
Oct 5, 2021
Merged

Differential splicing analysis. #23

merged 57 commits into from
Oct 5, 2021

Conversation

johanneskoester
Copy link
Contributor

This PR introduces differential splicing analysis with IsoformSwitchAnalyzeR.

# https://rdrr.io/bioc/IsoformSwitchAnalyzeR/man/analyzeCPAT.html
coding_cutoff: 0.725
# Should be set to true when using de-novo assembled transcripts.
remove_noncoding_orfs: false
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to add the alpha/sig-level here. I would like to harmonize the notation for significance in the pipeline. Maybe even just one global threshold? Allowing to set it differently everywhere seems both error prone and even problematic from a philosophic perspective.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Jep, I see your point and have been thinking about this, as well. But I don't see a perfect solution, because different analyses might be run with different goals in mind. As a little made-up example, a user might want to:

  • control the false discovery rate of the differential expression analysis very permissively (say at 0.4, to include most true positives), to have as many candidates in there as possible for downstream analyses
  • control the false discovery rate of the pathway enrichment analysis downstream very strictly (say at 0.005), to yield a very precise set of pathways, because they have this fancy setup to test pathways where it is very expensive to test each pathway
  • control the false discovery rate of the differential splicing analysis somewhere in between (say at 0.1), because they have a cheap way of following those results that scales fairly well and want to test a good number of candidates

This already gives us three different thresholds. But we definitely have to sit down and get an overview of how many different thresholds are in any way useful and then maybe create one threshold category in the config.yaml with those thresholds and expressive comments about which steps they control. This way we could drop all the plot- and method-specific thresholds in favor of more general thresholds. What do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds good!

cds="results/refs/transcriptome.cds.fasta",
ncrna="results/refs/transcriptome.ncrna.fasta"
output:
"results/refs/cpat.logit.RData"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The paths have to be adjusted though. This should go into "resources/...".

Copy link
Collaborator

@dlaehnemann dlaehnemann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, some minor points to address and a bunch of tidyverse suggestions. :)

# https://rdrr.io/bioc/IsoformSwitchAnalyzeR/man/analyzeCPAT.html
coding_cutoff: 0.725
# Should be set to true when using de-novo assembled transcripts.
remove_noncoding_orfs: false
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Jep, I see your point and have been thinking about this, as well. But I don't see a perfect solution, because different analyses might be run with different goals in mind. As a little made-up example, a user might want to:

  • control the false discovery rate of the differential expression analysis very permissively (say at 0.4, to include most true positives), to have as many candidates in there as possible for downstream analyses
  • control the false discovery rate of the pathway enrichment analysis downstream very strictly (say at 0.005), to yield a very precise set of pathways, because they have this fancy setup to test pathways where it is very expensive to test each pathway
  • control the false discovery rate of the differential splicing analysis somewhere in between (say at 0.1), because they have a cheap way of following those results that scales fairly well and want to test a good number of candidates

This already gives us three different thresholds. But we definitely have to sit down and get an overview of how many different thresholds are in any way useful and then maybe create one threshold category in the config.yaml with those thresholds and expressive comments about which steps they control. This way we could drop all the plot- and method-specific thresholds in favor of more general thresholds. What do you think?

workflow/rules/common.smk Outdated Show resolved Hide resolved
workflow/scripts/sleuth-init.R Show resolved Hide resolved
# TODO migrate this to tidyverse
# Ensure that primary variable factors are sorted such that base_level comes first.
# This is important for fold changes, effect sizes to have the expected sign.
samples[, primary_variable] <- relevel(samples[, primary_variable, drop=TRUE], base_level)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fct_relevel() should be your friend: https://forcats.tidyverse.org/reference/fct_relevel.html

The following should move the base_level to the front of the levels vector:

samples $>$ mutate(!!primary_variable := fct_relevel(!!primary_variable, !!base_level)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent!

workflow/scripts/sleuth-init.R Outdated Show resolved Hide resolved
workflow/scripts/isoform-switch-analysis-annotate.R Outdated Show resolved Hide resolved
workflow/scripts/isoform-switch-analysis-annotate.R Outdated Show resolved Hide resolved
workflow/scripts/isoform-switch-analysis-init.R Outdated Show resolved Hide resolved
workflow/envs/isoform-switch-analyzer.yaml Outdated Show resolved Hide resolved
"../envs/isoform-switch-analyzer.yaml"
params:
model=get_model,
samples=["{unit.sample}-{unit.unit}".format(unit=unit) for unit in units.itertuples()],
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this would conflict with the merging of samples into one file per sample, depending on how we resolve PR #16 .

@johanneskoester johanneskoester merged commit 55825a9 into main Oct 5, 2021
@johanneskoester johanneskoester deleted the diffsplice branch October 5, 2021 14:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants