-
Notifications
You must be signed in to change notification settings - Fork 2
RNA-Seq Workflow v2.0 #1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
- Added `multiqc` step to workflow. - Went back and corrected the TODO list as the `qualimap` pinning has already happened.
|
no idea what fq lint is Previously, we were filtering out anything not matching "level 1" or "level 2" from the gene model: Some of the GATK and Picard tools have stringency setting. Check whether one exists for ValidateSamFile. The risk is that GATK-based tools will crap out downstream if the bam file isn;t "just so" --sjdbOverhang 125 check that this makes sense for our data Split BAM file into multiple BAMs on the different read groups: Why are we doing this? We only want to give STAR a single pair of fastqs, don't we? Run picard MarkDuplicates on the STAR-aligned BAM file. What is GDC doing? Last I heard there is no consensus on whether to mark/remove duplicates from RNASeq data. So long as we mark and do not remove, I think it should be fine. Why not add splice junction RNA-Peg and HTSeq/Salmon quantification here since you are already messing with the data, anyway. Same for coverage bw file. May as well get it all out of the way in one go. Or else, break these out in to a downstream module for secondary analysis. Using the full path in the header might be useful for forensic reasons in the future. |
Michael Macias and I developed this tool to catch common errors we see in FastQ files. We use this to ensure the BAM to FastQ step worked properly.
I'm of the same mind.
Noted, I think we are covered with our current use of
Most places use 100 for this value. The consensus from our RNA-Seq workflow v1 meeting was that it was better to go over than under on this parameter, and some of our data may be more than 100bp.
Primarily we do this because it's the easiest way to retain the read group information and feed it back into STAR for the remapped BAM. So you split the original BAM by read group, you store the read group information alongside the pair of FastQ files, then you pass it in comma-delimited order back to STAR.
Based on their mRNA documentation, they don't appear to be doing any duplicate marking/removing (I find that hard to believe). It may just not be documented. For what it's worth, they do say that they use
Agree with this, I think it's probably safer to go with HTSeq than Salmon at this point. I can give more details on my feelings why that is the case if people are interested. I hadn't considered computing the bigWig file, but it seems reasonable.
Begrudgingly agree. |
|
|
||
| # Outstanding questions | ||
|
|
||
| * Any parameters we want to change during the STAR alignment step? I don't expect any, but we should explicitly discuss. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Considering outlier failures, the two parameters that I sometimes have to tweak are limitSjdbInsertNsj and limitOutSJcollapsed. They are few enough to treat as special cases, but I'm just noting it here for discussion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I agree with this. I'll put it in the "to investigate" list.
|
"The inclusion of the ERCC genome in all of our RNA-Seq samples was violating that practice and causing problems for cross-target analyses" Removing the ERCC spike in controls makes sense since they are causing incompatibility with other pipelines. They are a special case and possible candidate for a future tool "Previously, we were filtering out anything not matching "level 1" or "level 2" from the gene model" -sjdbOverhang 125 is not the "ideal" according the manual given current mapping, but more is harmless and some users might have longer reads eventually that may take advantage of this extension htseq-count -m union vs -m intersection-nonempty or intersection-strict The union option is the standard/default, most people will use it. The "intersection non-empty" option will add reads when the read fully maps to one transcript over partial mapping in others. I think it is reasonable and may happen more often if level 3 transcripts are included. Those reads otherwise would be excluded. GDC uses it, so including it will move us one step closer to harmonized/federated data, although I don't like all their choices. As for gene id or gene name I think gene name is more useful for end users in a count file. We need to decide how valuable it is to be harmonized with GDC. I would choose end users over GDC at this stage. "GDC like" results that use all GDC options could be an optional format at some later point. --secondary-alignments ignore |
|
Thanks a lot for preparing the proposal. I enjoyed reading this.
|
Yes, the author of HTSeq noted these should typically be ignored. From the release notes for 0.11.0:
|
At this point, I think generating MD5s only would be sufficient, but we may also want to generate SHA256 internally because I think that is a better long-term solution. |
80196e8 to
7f85a7a
Compare
|
Some further thoughts regarding coverage bw file...... Scott:
Clay:
Alex: Can we incorporate into the pipeline? |
|
@adthrasher, thoughts? ^ |
I agree with Alex that this is a good idea to include. I'll take a look at what is involved. |
|
The two week public comment period has elapsed. So this is considered final. Any further changes can be made through a new PR. |
|
Assigned RFC 0001 |
|
Just saw this and wanted to say the GDC absolutely does not mark duplicates for RNA-Seq. That is generally not a good practice for RNA-Seq unless you have UMI where you can confidently remove duplicates. We also run all the readgroups at once through star (you have to do SE and PE separately though). |
|
Hi @kmhernan, thanks for your comment on this. Note that we are only marking the duplicates (not removing them). I'm curious if you are aware of any negative downstream effects from just marking duplicates for RNA-Seq data? Most tools that should account for true biological duplications (e.g. expression quantification using If true, that would make the decision whether to mark the duplicates or not one of opinion. After giving it a few moments thought, it may actually be better to assert nothing at all about the duplicate status than to mark them. That would also reduce the runtime, which is a plus. For those interested, this is worth skimming: https://www.nature.com/articles/srep25533.
I'll discuss with our team. Thanks again for your contribution! cc: @adthrasher |
|
After internal consideration, we have decided to no longer mark duplicates for RNA-Seq. The pipeline will be updated in the near future. |
|
Hey sorry was traveling this weekend. Yeah just marking seems like it wouldn’t be an issue (only compute cost). HTSeq does some bizarro things that you can only really see in source (at least from my experience). I actually prefer just getting the counts from STAR cause kallisto Salmon etc seem to perform better in quantification anyways. If you wanted to call rna-seq variants (no good way to do this yet) then your markdups step would be needed. |
|
Hi folks - Thanks for inviting feedback. As you know, we at Treehouse have many similar issues to those you address here (1). We agree that updating reference files and software is worthwhile. We do not perform gene model post-processing for the reasons you've outlined. We include potential duplicate reads in our quantification steps, but subsequently mark duplicates and count non-duplicate reads as a QC step. We have found that counting mapped, exonic, non-duplicate reads has allowed us to identify and exclude some extremely low quality samples that would not otherwise be detected (2,3) 1 Vaske et al, 2019, http://dx.doi.org/10.1001/jamanetworkopen.2019.13968 |
|
Thanks for reviewing @hbeale, this is really helpful and affirming feedback. We'll be sure to let you know when we release the new version of the data. |
Rendered