Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MultiQC for list of pairs (of FastQC output) #1658

Closed
bernt-matthias opened this issue Jan 4, 2018 · 16 comments
Closed

MultiQC for list of pairs (of FastQC output) #1658

bernt-matthias opened this issue Jan 4, 2018 · 16 comments

Comments

@bernt-matthias
Copy link
Contributor

If I run FastQC on a List of Paired data I get also the FastQC results as list of pairs. These are not selectable as input in a MultiQC analysis.

I'm unsure if this should be fixed in FastQC (maybe it should just output a list of (unpaired) reports) or MultiQC which maybe should also support list of paired data as input...?

@mblue9
Copy link
Contributor

mblue9 commented Jan 4, 2018

Yes that would be great if MultiQC could work with paired collections and FastQC! I ran into this issue aswell.

@bernt-matthias
Copy link
Contributor Author

Do you know why it does not accept lists of pairs as input? Is there some thing wrong in the tool xml form description?

@mvdbeek
Copy link
Member

mvdbeek commented Jan 5, 2018

fastqc (the software, not the wrapper) is not aware of paired end datasets.
What you can do is:

  • create interleaved fastqs
  • create 2 separate lists with the Unzip Collection tool and:
    • merge the forward and reverse
    • or choose to only analyse the forward or reverse
  • look out for a replacement for fastqc that can deal with paired end data

@bernt-matthias
Copy link
Contributor Author

Thanks for the information.

What I do not understand is why the tool form does not list the available list of pairs as input. Would this need an additional parameter in the <param> tag? It might be reasonable because I guess that also the <command> part would need changes for iterating over the list of pairs.

For the application I think it does not matter that FastQC does not know about paired data sets. One could treat them separately with FastQC. This is also what actually happens: a list of pairs of fastq data is transformed by FastQC into a list of pairs of reports. So I would suggest to modify the MultiQC tool such that also list of pairs are accepted and the result should be the same as if a joint list of single fastq files is used as input. The toolbox function of MultiQC allows to color by sample name or forward/reverse information (or any other info contained in the filename). If users prefer to analyze interleaved data they can still do so, but I just learned that forward and reverse read may differ in their qualities, i.e. it makes sense to treat them separately.

@mvdbeek
Copy link
Member

mvdbeek commented Jan 5, 2018

So I would suggest to modify the MultiQC tool such that also list of pairs are accepted and the result should be the same as if a joint list of single fastq files is used as input

How would that work ? Just iterating over the forward or reverse reads? Why not do this explicitly using the unzip collection tool ? I'm happy to see an implementation, but I don't think MultiQC does well with R1/R2 or forward/reverse pairs based on MultiQC/MultiQC#542.

Now if the report was actually for paired end data this wouldn't be an issue at all.

@mvdbeek
Copy link
Member

mvdbeek commented Jan 5, 2018

I think http://multiqc.info/docs/#afterqc may be a good option to replace fastqc for this purpose.

@mblue9
Copy link
Contributor

mblue9 commented Jan 6, 2018

Yes I'd love to see a Galaxy tool for AfterQC anyway and if helps get around this issue with FastQC then that would be great!

@bernt-matthias
Copy link
Contributor Author

Just a note: according to the readme of https://github.com/OpenGene/AfterQC afterqc has been reimplemented: https://github.com/OpenGene/fastp

@bebatut
Copy link
Member

bebatut commented Apr 9, 2018

Do you think we should handle the paired collection for FastQC module of MultiQC? And then only for FastQC?

@bernt-matthias
Copy link
Contributor Author

Thanks for reviving this issue @bebatut .

First, I would like to understand why MultiQC does not accept a list of pairs of FastQC as input. I do not see anything in the tool xml that limits this. Is this because multiple="true" is used?

Second, I think that the workaround of @mvdbeek (create two separate lists and merge them / maybe flatten the collection / interleaving them) should be sufficient to do what I want. In the end I only want to analyze a forward and reverse reads together.

Concluding, I would suggest to close the issue if we understand the technical reason why paired lists are not accepted.

@bebatut
Copy link
Member

bebatut commented Apr 11, 2018

I tested: if you have multiple=true, you can select a pair of datasets but not a list of pairs. I opened an issue on Galaxy (galaxyproject/galaxy#5875) to know if it is an expected behaviour

@jmchilton
Copy link
Member

jmchilton commented Apr 11, 2018

It is definitely the expected behavior. I think in this use case the semantics of what should occur is kind of clear because you know a lot of the tool and data that Galaxy does not know and that isn't represented in the tool wrapper.

The problem is say I allow a list:paired be sent to any data wrapped with multiple=true enabled. There are two different things that one can imagine doing with that - running the tool over each pair as a separate job and producing a list as a result or running all of the files together and producing a single output (maybe call this a complete reduction).

The complete reduction option makes sense here for this tool, but I would imagine if the tool was like "cat1" or some tool that summarized fastq files the other option of mapping over the list and reducing the pairs would make some sense in some cases. And the mapping over the list and reducing the pairs behavior is more inline with what happens for instance if a list:list is passed to a multiple data parameter - there I think Galaxy does the reduction of the inner lists and maps over the outer lists - which generally makes sense IMO.

In the list:list or the list:paired case - if you know you want to wipe out the structure of the nested collection - Galaxy provides the tools to do that - you can use the flatten collection operation tool. It should reorganize the information to make it clear the nested structure is not important for a give application or part of your workflow. This may feel heavy - but it shouldn't duplicate any of the actual data on disk and it should run relatively quickly.

Newer tool form options should be introduces that would let the researcher say -
"hey I know this is a nested collection but just do a complete reduction on it"
(xref galaxyproject/galaxy#4707 / xref galaxyproject/galaxy#4623), but I think for now the flatten tool is the way to go.

@bebatut
Copy link
Member

bebatut commented Apr 13, 2018

So for now, I think we can not do anything. @bernt-matthias what do you think?

@bernt-matthias
Copy link
Contributor Author

I agree. Thanks to @jmchilton for the explanations.

@bebatut
Copy link
Member

bebatut commented Apr 13, 2018

Yes thanks @jmchilton !!!

@jennaj
Copy link
Member

jennaj commented Jun 5, 2018

Still a problem even with collection "list" input. Another fix is going on. Should this be re-opened too?

galaxyproject/usegalaxy-playbook#114

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants