Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using a fasta file with multiple adapter sequences for trimming #86

Closed
tobjores opened this issue Apr 1, 2020 · 14 comments
Closed

Using a fasta file with multiple adapter sequences for trimming #86

tobjores opened this issue Apr 1, 2020 · 14 comments

Comments

@tobjores
Copy link

tobjores commented Apr 1, 2020

Since version 1.5, cutadapt accepts a fasta file with multiple adapters by using "file:adapters.fasta" as adapter. This feature, however, cannot be used with trim_galore, as it enforces the adapters to consist only of DNA letters. Yet, this could be easily implemented by adding a test for the prefix "file:" to the adapter check routine. If this is encountered the adapter "file:..." can be directly forwarded to cutadapt (without upper-casing it first). Optionally, trim_galore could check if the specified file exists and issue an error if not.

@FelixKrueger
Copy link
Owner

Hi @tobjores

Generally, it wouldn't be a problem to allow additional characters in the adapter sequence. We have so far not allowed to use an entire file of adapter sequences, because we were of the opinion that this hardly ever something you want to be doing for typical genomic use cases.
For instance, if you wanted to supply 48 different full length TruSeq adapter sequences that only differ in their index sequence, it is perfectly sufficient to supply the sequence that all adapters have in common (in this particular example the auto-detection would even find and do this for you).

Could you give an example of your intended use case?

@tobjores
Copy link
Author

tobjores commented Apr 1, 2020

In my case, I had promoter variants that were linked to three different 5'-UTRs. For aligning the promoter sequences to the reference, I had to trim off the UTRs. I could have done that by running trim_galore three times (once for each adapter) or one time with a file containing the three UTR sequences.
I agree that this is probably a rare use case, but it is easy enough to implement:
Exchanging

unless ($adapter =~ /^[ACTGNXactgnx]+$/){
    die "Adapter sequence must contain DNA characters only (A,C,T,G or N)!\n";
}
}
$adapter = uc$adapter;

with

if ($adapter =~ /^[ACTGNXactgnx]+$/){
    $adapter = uc$adapter;
}
else ($adapter =~ /^file:/){
    # optionally add a check for file existence here (my perl is not good enough for that)
    $adapter = $adapter;
}
else {
    die "Adapter sequence must contain DNA characters only (A,C,T,G or N) or be a file name prefixed with 'file:'!\n";
}

should do the trick

@FelixKrueger
Copy link
Owner

Hi @tobjores

Apologies for the slow response, in addition to all the working from home we also had some tragic incidents at our institute...

I have tried to understand the issue you raised, and you might indeed be right that the -a file:adapters.fasta could work. It would however require me to run a few more tests to make sure that this logic doesn't break anything else within Trim Galore.

In the meantime, I have pushed a new version to Github that should allow you to specify your three different adapter on the command line. The format would have to look like this:

trim_galore -a " AGCTAGCG -a TCTCTTATAT -a TTTATTCGGATTTAT" file.fastq.gz

Where I just picked some random DNA sequences as examples. Could you clone the latest dev version of Trim Galore and let me know if it works in your hands? Once I've got a bit more time I will try to look into the file:adapters.fasta solution a bit more.

Best, Felix

@tobjores
Copy link
Author

tobjores commented Apr 2, 2020

Hi @FelixKrueger
the dev version works great for me, although currently multiple adapters only work for adapter1 but not adapter2. Thanks for taking care of this!
Best,
Tobias

@FelixKrueger
Copy link
Owner

Hi Tobias,

Thanks for the feedback. I have now done some more tests, and can confirm that you can now either use a command line option such as this:

-a  " AGCTCCCG -a TTTCATTATAT -a TTTATTCGGATTTAT"
-a2 " AGCTAGCG -a TCTCTTATAT -a TTTCGGATTTAT"

or also using the file:filename.fa option, like so:

-a "file:../multiple_adapters.fa"
-a2 "file:../different_adapters.fa"

Or any combination thereof. Please get the latest dev version once more, and let me know if something does not appear to be working as expected.

Stay safe!

@jts
Copy link

jts commented Sep 3, 2020

Hi @FelixKrueger,

I would like to use this feature in a project, can I request that you tag a new release and also update the conda version?

Thanks,
Jared

@FelixKrueger
Copy link
Owner

I'll try to make a release very soon (Sunday, Monday?), I think the Conda event gets triggered automatically.

@FelixKrueger
Copy link
Owner

Alright, managed to get it out just now. Enjoy, I hope it works in your hands! Felix

@jts
Copy link

jts commented Sep 4, 2020 via email

@alephreish
Copy link

alephreish commented May 16, 2022

Correct me if I'm wrong but neither one of these strategies is guaranteed to trim all adapters as cutadapt's -n argument is not exposed to the user.

@FelixKrueger
Copy link
Owner

That's correct. I suppose you could 'trick' it into using the -n option by specifying -n .. within the adapter double quotes, like so:

-a  " AGCTCCCG -a TTTCATTATAT -a TTTATTCGGATTTAT -n 3"

@mayagomez
Copy link

Hi Felix! On a similar note, if I want to trim multiple adapters from the R2 paired file, does it work to structure it the same way? Ex. -a2 " AGCTCCCG -a2 TTTCATTATAT -a2 TTTATTCGGATTTAT -n 3". In my scenario I have long polyA and polyT tails on both my R1 and R2 reads and would like to trim from both simultaneously. The follow code no longer works when I try to specify multiple sequences to trim.
trim_galore --nextseq 20 -a " AAAAAAAAAA -a TTTTTTTTTT -n 3" -a2 " AAAAAAAAAA -a2 TTTTTTTTTT -n 3" --paired $file $Secondpair --retain_unpaired -o trimmed2/ Thank you!

@FelixKrueger
Copy link
Owner

I think the solution here would be to replace the -a2 " AAAAAAAAAA -a2 TTTTTTTTTT -n 3" with an internal -a only:

OLD: -a2 " AAAAAAAAAA -a2 TTTTTTTTTT -n 3"
NEW: -a2 " AAAAAAAAAA -a TTTTTTTTTT -n 3"

The reason for this is that R1 and R2 do in fact get processed (as single-end reads) separately, which the second -a2 within quotes breaks. I hope that makes sense?

In essence, this command line should work:

trim_galore --nextseq 20 -a " AAAAAAAAAA -a TTTTTTTTTT -n 3" -a2 " AAAAAAAAAA -a TTTTTTTTTT -n 3" --paired $file $Secondpair --retain_unpaired -o trimmed2/

@mayagomez
Copy link

@FelixKrueger That worked! Thank you so much!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants