-
Notifications
You must be signed in to change notification settings - Fork 46
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Differential splicing analysis. #23
Conversation
# https://rdrr.io/bioc/IsoformSwitchAnalyzeR/man/analyzeCPAT.html | ||
coding_cutoff: 0.725 | ||
# Should be set to true when using de-novo assembled transcripts. | ||
remove_noncoding_orfs: false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to add the alpha/sig-level here. I would like to harmonize the notation for significance in the pipeline. Maybe even just one global threshold? Allowing to set it differently everywhere seems both error prone and even problematic from a philosophic perspective.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Jep, I see your point and have been thinking about this, as well. But I don't see a perfect solution, because different analyses might be run with different goals in mind. As a little made-up example, a user might want to:
- control the false discovery rate of the differential expression analysis very permissively (say at 0.4, to include most true positives), to have as many candidates in there as possible for downstream analyses
- control the false discovery rate of the pathway enrichment analysis downstream very strictly (say at 0.005), to yield a very precise set of pathways, because they have this fancy setup to test pathways where it is very expensive to test each pathway
- control the false discovery rate of the differential splicing analysis somewhere in between (say at 0.1), because they have a cheap way of following those results that scales fairly well and want to test a good number of candidates
This already gives us three different thresholds. But we definitely have to sit down and get an overview of how many different thresholds are in any way useful and then maybe create one threshold category in the config.yaml
with those thresholds and expressive comments about which steps they control. This way we could drop all the plot- and method-specific thresholds in favor of more general thresholds. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That sounds good!
workflow/rules/refs.smk
Outdated
cds="results/refs/transcriptome.cds.fasta", | ||
ncrna="results/refs/transcriptome.ncrna.fasta" | ||
output: | ||
"results/refs/cpat.logit.RData" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The paths have to be adjusted though. This should go into "resources/...".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me, some minor points to address and a bunch of tidyverse suggestions. :)
# https://rdrr.io/bioc/IsoformSwitchAnalyzeR/man/analyzeCPAT.html | ||
coding_cutoff: 0.725 | ||
# Should be set to true when using de-novo assembled transcripts. | ||
remove_noncoding_orfs: false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Jep, I see your point and have been thinking about this, as well. But I don't see a perfect solution, because different analyses might be run with different goals in mind. As a little made-up example, a user might want to:
- control the false discovery rate of the differential expression analysis very permissively (say at 0.4, to include most true positives), to have as many candidates in there as possible for downstream analyses
- control the false discovery rate of the pathway enrichment analysis downstream very strictly (say at 0.005), to yield a very precise set of pathways, because they have this fancy setup to test pathways where it is very expensive to test each pathway
- control the false discovery rate of the differential splicing analysis somewhere in between (say at 0.1), because they have a cheap way of following those results that scales fairly well and want to test a good number of candidates
This already gives us three different thresholds. But we definitely have to sit down and get an overview of how many different thresholds are in any way useful and then maybe create one threshold category in the config.yaml
with those thresholds and expressive comments about which steps they control. This way we could drop all the plot- and method-specific thresholds in favor of more general thresholds. What do you think?
# TODO migrate this to tidyverse | ||
# Ensure that primary variable factors are sorted such that base_level comes first. | ||
# This is important for fold changes, effect sizes to have the expected sign. | ||
samples[, primary_variable] <- relevel(samples[, primary_variable, drop=TRUE], base_level) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fct_relevel()
should be your friend: https://forcats.tidyverse.org/reference/fct_relevel.html
The following should move the base_level
to the front of the levels vector:
samples $>$ mutate(!!primary_variable := fct_relevel(!!primary_variable, !!base_level)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Excellent!
workflow/rules/diffsplice.smk
Outdated
"../envs/isoform-switch-analyzer.yaml" | ||
params: | ||
model=get_model, | ||
samples=["{unit.sample}-{unit.unit}".format(unit=unit) for unit in units.itertuples()], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this would conflict with the merging of samples into one file per sample, depending on how we resolve PR #16 .
This PR introduces differential splicing analysis with IsoformSwitchAnalyzeR.