Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Missing fullread bam with unpaired data #21

Open
pbasting opened this issue Dec 22, 2020 · 0 comments
Open

Missing fullread bam with unpaired data #21

pbasting opened this issue Dec 22, 2020 · 0 comments

Comments

@pbasting
Copy link

Hello,

I'm currently developing the McClintock meta-pipeline for TE detection software, and I've been working on integrating RelocaTE2 as a component method. Right now, RelocaTE2 is running well on paired-end data but I've run into an issue when trying to run it with unpaired data. It seems that if no paired-end data is detected, the fullread bam file will never be produced by relocaTE_align.py causing downstream steps to fail.

Here is how I've replicated the error with the test data

# install
conda create --name relocate2 -c bioconda relocate2=2.0.1=pypl526_4
conda activate relocate2

# get test data
git clone https://github.com/JinfengChen/RelocaTE2.git

# make an "unpaired" fastq directory
test_dir="RelocaTE2/test_data"
mkdir unpaired
cp $test_dir/MSU7.Chr3_2M.ALL_reads/MSU7.Chr3_2M.ALL_reads_6X_100_500_1.fq.gz unpaired/MSU7.Chr3_2M.unPaired.fq.gz

# run relocaTE2 with only unpaired data
relocaTE2.py --te_fasta $test_dir/RiceTE.fa \
    --genome_fasta $test_dir/MSU7.Chr3_2M.fa \
    --fq_dir unpaired \
    --outdir out \
    --reference_ins $test_dir/MSU7.Chr3_2M.fa.RepeatMasker.out \
    --sample rice \
    --size 500 \
    --step 1234567 \
    --mismatch 2 \
    --cpu 20 \
    --aligner blat \
    --verbose 4

stderr & stdout

/scratch/pjb68507/test/relocate2/unpaired
fastq
job: sh /scratch/pjb68507/test/relocate2/out/shellscripts/step_2/0.fq2fa.sh
job: sh /scratch/pjb68507/test/relocate2/out/shellscripts/step_3/0.te_repeat.blat.sh
/scratch/pjb68507/test/relocate2/out/repeat/flanking_seq/MSU7.Chr3_2M.unPaired.te_repeat.flankingReads.fq
all unpaired: /scratch/pjb68507/test/relocate2/out/repeat/flanking_seq/MSU7.Chr3_2M.unPaired.te_repeat.flankingReads.fq
testing if bam exists: /scratch/pjb68507/test/relocate2/out/repeat/bwa_aln/MSU7.Chr3_2M.MSU7.Chr3_2M.unPaired.te_repeat.flankingReads.fq_1.te_repeat.flankingReads.bwa.mates.bam
pre: /scratch/pjb68507/test/relocate2/out/repeat/flanking_seq/MSU7.Chr3_2M.unPaired.te_repeat.flankingReads.fq
bam not exists, preceed with bwa to map the reads
[bwa_aln_core] convert to sequence coordinate... 0.00 sec
[bwa_aln_core] refine gapped alignments... 0.02 sec
[bwa_aln_core] print alignments... 0.02 sec
[bwa_aln_core] 16108 sequences have been processed.
[main] Version: 0.6.2-r126
[main] CMD: /home/pjb68507/software/miniconda/envs/relocate2/bin/bwa samse /scratch/pjb68507/test/relocate2/RelocaTE2/test_data/MSU7.Chr3_2M.fa /scratch/pjb68507/test/relocate2/out/repeat/bwa_aln/MSU7.Chr3_2M.MSU7.Chr3_2M.unPaired.te_repeat.flankingReads.bwa.single.sai /scratch/pjb68507/test/relocate2/out/repeat/flanking_seq/MSU7.Chr3_2M.unPaired.te_repeat.flankingReads.fq
[main] Real time: 0.325 sec; CPU: 0.075 sec
mergeing bam file: 1/1 files
mergeing fullread bam file: 0/0 files
job: sh /scratch/pjb68507/test/relocate2/out/shellscripts/step_4/step_4.MSU7.Chr3_2M.repeat.align.sh
Step5: Find non-reference insertions
find insertions on Chr3
fullread bam: /scratch/pjb68507/test/relocate2/out/repeat/bwa_aln/MSU7.Chr3_2M.repeat.fullreads.bwa.sorted.bam
Traceback (most recent call last):
  File "/home/pjb68507/software/miniconda/envs/relocate2/bin/relocaTE_insertionFinder.py", line 1825, in <module>
    main()
  File "/home/pjb68507/software/miniconda/envs/relocate2/bin/relocaTE_insertionFinder.py", line 1809, in main
    read_junction_reads_align(align_file_f, read_repeat, teJunctionReads)
  File "/home/pjb68507/software/miniconda/envs/relocate2/bin/relocaTE_insertionFinder.py", line 1648, in read_junction_reads_align
    fsam = pysam.AlignmentFile(align_file_f, 'rb')
  File "pysam/calignmentfile.pyx", line 333, in pysam.calignmentfile.AlignmentFile.__cinit__ (pysam/calignmentfile.c:4808)
  File "pysam/calignmentfile.pyx", line 533, in pysam.calignmentfile.AlignmentFile._open (pysam/calignmentfile.c:7027)
IOError: file `/scratch/pjb68507/test/relocate2/out/repeat/bwa_aln/MSU7.Chr3_2M.repeat.fullreads.bwa.sorted.bam` not found
job: sh /scratch/pjb68507/test/relocate2/out/shellscripts/step_5/0.repeat.findSites.sh
Step6: Find reference insertions
Traceback (most recent call last):
  File "/home/pjb68507/software/miniconda/envs/relocate2/bin/relocaTE_absenceFinder.py", line 967, in <module>
    main()
  File "/home/pjb68507/software/miniconda/envs/relocate2/bin/relocaTE_absenceFinder.py", line 963, in main
    write_output(top_dir, result, read_repeat, usr_target, exper, TE, required_reads, required_left_reads, required_right_reads, teInsertions, teInsertions_reads, teSupportingReads, existingTE_intact, existingTE_found, teReadClusters, bedtools, lib_size, existingTE_intact_id)
  File "/home/pjb68507/software/miniconda/envs/relocate2/bin/relocaTE_absenceFinder.py", line 146, in write_output
    REF     = open ('%s/%s.%s.all_ref_insert.txt' %(result, usr_target, TE), 'w')
IOError: [Errno 2] No such file or directory: '/scratch/pjb68507/test/relocate2/out/repeat/results/Chr3.repeat.all_ref_insert.txt'
job: sh /scratch/pjb68507/test/relocate2/out/shellscripts/step_6/0.repeat.absence.sh
/scratch/pjb68507/test/relocate2/out/shellscripts/step_7/0.repeat.characterize.sh: line 1: /scratch/pjb68507/test/relocate2/out/repeat/results/ALL.all_nonref_insert.gff: No such file or directory
/scratch/pjb68507/test/relocate2/out/shellscripts/step_7/0.repeat.characterize.sh: line 2: /scratch/pjb68507/test/relocate2/out/repeat/results/ALL.all_nonref_insert.txt: No such file or directory
cat: /scratch/pjb68507/test/relocate2/out/repeat/results/*.all_nonref_insert.txt: No such file or directory
sh: /scratch/pjb68507/test/relocate2/out/repeat/results/ALL.all_nonref_insert.overlap: No such file or directory
/scratch/pjb68507/test/relocate2/out/shellscripts/step_7/0.repeat.characterize.sh: line 4: /scratch/pjb68507/test/relocate2/out/repeat/results/ALL.all_ref_insert.txt: No such file or directory
/scratch/pjb68507/test/relocate2/out/shellscripts/step_7/0.repeat.characterize.sh: line 5: /scratch/pjb68507/test/relocate2/out/repeat/results/ALL.all_ref_insert.gff: No such file or directory
cannot open /scratch/pjb68507/test/relocate2/out/repeat/results/ALL.all_nonref_insert.txt No such file or directory
job: sh /scratch/pjb68507/test/relocate2/out/shellscripts/step_7/0.repeat.characterize.sh
rm: cannot remove/scratch/pjb68507/test/relocate2/out/repeat/fastq_split’: No such file or directory
job: sh /scratch/pjb68507/test/relocate2/out/clean_intermediate_files.sh
  • The first exception raised is the IOError: file '/scratch/pjb68507/test/relocate2/out/repeat/bwa_aln/MSU7.Chr3_2M.repeat.fullreads.bwa.sorted.bam' not found and I'm assuming the remaining exceptions are a consequence of relocaTE_insertionFinder.py failing to run due to this missing file.
  • When looking at relocaTE_align.py I think that, in the function that is supposed to produce the fullread bam, if no paired end data is found, it will not actually produce the fullread bam but just the single.bam
    else:
    #paired not provided or not found, map reads as unpaired
    fastq = flanking_fq_list[0]
    fq_name = os.path.splitext(os.path.split(fastq)[1])[0]
    bwa_run(path, genome_file, fastq, fq_name, target, 'single', bwa, samtools)
    out_files.append('%s/bwa_aln/%s.%s.bwa.single.bam' %(path, target, fq_name))
  • I've noticed this problem mentioned in a few other issues ( cant find file xxxx.repeat.fullreads.bwa.sorted.bam #19 NO FULLREADS FILES PRODUCED #17 error message #14 IOError  #9) and I think it's likely that the cause is that they are running RelocaTE2 without paired data (either intentionally like me, or unintentionally due to improper naming of fastq files).
  • I am not sure if it is intended for RelocaTE2 to require paired end data but I assumed that, due to the functionality of the original RelocaTE and the presence of the --unpaired_id flag, that RelocaTE2 could work with both paired and unpaired data.

I appreciate any help you can provide on this issue.

Thanks,

Preston

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant